Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Justin Bleich is a Senior Data Scientist at Coatue Management. Prior to Coatue, Justin was the co-founder and CTO of Zodiac, an artificial intelligence startup that focused on predicting customer behavior to help brands retain their best customers and find more like them. Additionally, Justin was an Adjunct Professor at The Wharton School at the University of Pennsylvania where he taught advanced data mining and predictive modeling. Justin received his PhD in Statistics from The Wharton School where he focused on Bayesian machine learning and ensemble-of-trees algorithms.

Presentations

Probabilistic programming in finance using Prophet Session

Coatue is hedge fund that uses data science to drive investment decisions. Prophet is a Bayesian nonlinear time series forecasting model, recently released by Facebook. We extend Prophet to include exogenous covariates when generating forecasts. We apply our modification to the task of nowcasting macroeconomic series using higher frequency data available from sources such as Google Trends.

Ashvin Agrawal is a senior research engineer at Microsoft, where he works on streaming systems and contributes to the Twitter Heron project. Ashvin is a software engineer with more than 10+ years experience. He specializes in developing large-scale distributed systems. Previously, he worked at VMware, Yahoo, and Mojo Networks. Ashvin holds an MTech in computer science from IIT Kanpur, India.

Presentations

Modern Real Time Streaming Architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems, algorithms and deployment architectures.

Manish Ahluwalia is a software engineer at Cloudera, where he focuses on security of the Hadoop ecosystem. Manish has been working in big data since its infancy in various companies in Silicon Valley. He is most passionate about security.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Tyler Akidau is a staff software engineer at Google Seattle. He leads technical infrastructure’s internal data processing teams (MillWheel & Flume), is a founding member of the Apache Beam PMC, and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 Dataflow Model paper and the Streaming 101 and Streaming 102 articles on the O’Reilly website. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Foundations of Streaming SQL or: how I learned to love stream & table theory Session

What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing, or different? And how does all of this relate to the programmatic frameworks we’re all familiar with? Learn the answers to these questions and more as we explore key concepts underpinning data processing in general.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting A Data Platform Tutorial

What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop, Spark and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Managing Data Science in the Enterprise Tutorial

In this tutorial, we will share our methods and observations from three years of effectively deploying data science in enterprise organizations. Attendees will learn how to build, run, and get the most value from data science teams, and how to work with and plan for the needs of the business.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time data engineering in the cloud 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time data engineering in the cloud (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

The Five Dysfunctions of a Data Engineering Team Session

If you’re creating a data engineering team, there are common mistakes and patterns. These lead a data engineering team to either fail or perform at a much lower level. Early project success is predicated on management making sure the team is ready and has all of the skills needed.

Assaf Araki is responsible for big data analytics path findings in a group within Intel Information Technology that delivers advanced analytics and big data solutions across Intel. He drives the overall work with the academy and industry for big data analytics and merges new technologies inside Intel Information Technology. Assaf has over 10 years of experience in data warehousing, decision support solutions, and applied analytics within Intel.

Presentations

Hardcore Data Science welcome HDS

Hardcore Data Science hosts, Ben Lorica and Assaf Araki, welcome you to the day-long tutorial.

André Araujo is a solutions architect with Cloudera. Previously, he was an Oracle database administrator. An experienced consultant with a deep understanding of the Hadoop stack and its components, André is skilled across the entire Hadoop ecosystem and has specialized in building high-performance, secure, robust, and scalable architectures to fit customers’ needs. André is a methodical and a keen troubleshooter and loves making things run faster.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Eduardo Arino de la Rubia is chief data scientist at Domino Data Lab. Eduardo is a lifelong technologist with a passion for data science who thrives on effectively communicating data-driven insights throughout an organization. He is a graduate of the MTSU Computer Science department, General Assembly’s Data Science program, and the Johns Hopkins Coursera Data Science specialization. Eduardo is currently pursuing a master’s degree in negotiation, conflict resolution, and peacebuilding from CSUDH. You can follow him on Twitter as @earino.

Presentations

Leveraging open source automated data science tools Session

The promise of the automated statistician is as old as statistics itself. Eduardo Arino de la Rubia explores the tools created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation. Along the way, Eduardo compares open source tools such as TPOT and auto-sklearn and discusses their place in the DS workflow.

Shivnath Babu is the CTO at Unravel Data Systems and an Adjunct Professor of Computer Science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath co-founded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke which has been downloaded by over 100 companies. Shivnath has received a U.S. National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Presentations

Using ML to solve failure problems with ML and AI apps in Spark Session

A roadblock in the agility that comes with Spark is that application developers can get stuck with application failures, and have a tough time finding and resolving the issue. To address this roadblock, we have been working closely with Spark application developers to automatically identify and alleviate the root cause of application failures using ML techniques.

Infrastructure Product Lead at Spotify

Josh has spent the past four years enhancing Spotify’s data processing infrastructure. He’s helped expand Spotify’s Hadoop footprint from 180 machines in 2013 to over 2000 today and helped enable everyday real-time processing at Spotify.

Josh is currently working as the project lead for the data processing track of Spotify’s migration to Google Cloud Platform.

Along the way he has spoken Internationally about his experiences and Data Infrastructure at Spotify.

Presentations

Spotify in the Cloud: the next evolution of Data @ Spotify Session

In early 2016, Spotify decided that we didn’t want to be in the datacenter business. The future was big and fluffy; the future was the cloud. In this talk, two leaders from Data Infrastructure at Spotify will walk through what it takes to move to the cloud. We’ll do an overview of technology choices and challenges in the cloud as well as some of the lessons our organization learned along the way.

Travis Bakeman is a Sr. Manager of Systems Design and Strategy with a focus on network performance management and big data analytics working for T-Mobile for the last 17 years. Travis is responsible for multiple teams that deliver enterprise solutions leveraging off the shelf options such as Splunk, Oracle RAC, and open source technologies like Cloudera Hadoop. Travis started with the telecom industry in 1999 in data center operations, after leaving a career in military intelligence in the United States Army. During his tenure with T-Mobile he has acquired a broad range of experience to include operational support, database administration, data mediation, report development, data enrichment and frontend application design. Today Travis’s main focus is building big data applications based on practical industry experience and operational efficiencies at a significant reduce operating and capital cost.

Presentations

How T-Mobile Built a Massive-Scale Network Performance Management Platform on Hadoop Session

Learn how T-Mobile ported their large-scale network performance management platform (T-PIM) from a legacy database to a big data platform with Impala as the main reporting interface. This session will cover their migration journey, including the challenges they were facing, how they evaluated new technologies, lessons learned along the way, and the efficiencies gained as a result.

Michael Balint is a Senior Manager of Applied Solutions Engineering at NVIDIA. Prior to working at NVIDIA, Michael was a White House Presidential Innovation Fellow, where he brought his technical expertise to projects like VP Biden’s Cancer Moonshot and Code.gov. A graduate of both Cornell and Johns Hopkins University, he has had the good fortune of applying software engineering and data science to many interesting problems throughout his career, including: tailoring genetic algorithms to optimize air traffic, harnessing NLP to summarize product reviews, and automating the detection of melanoma via machine learning.

Presentations

Training a Deep Learning Risk Detection Platform Session

Learn how to bootstrap your own Deep Learning Framework to detect risk and threats in production operational systems using best of breed GPU-accelerated open source tools.

Kirit is currently Director of Product Management at Streamsets.

Presentations

Realtime Image Classification : Using Convolution Neural Networks on realtime streaming data. Session

Enterprises building data lakes often have to deal with very large volumes of image data that they have collected over the years. In this session we will talk about how some of the most sophisticated big data deployments are using Convolution Neural Nets to automatically classify images to add rich context about the content of the image, in realtime, while ingesting data at scale

Dr. Dominikus Baur works to make the form of data accessible in every
situation. As a data visualization and mobile interaction designer and
developer he is creating usable, aesthetic and responsive visualizations for
desktops, tablets and smartphones.

Dominikus holds a PhD in Media Informatics from the University of Munich
(Ludwig-Maximilians-Universität). His research focused on making our growing
personal databases of media, status updates and messages manageable by
everyone. Such personal visualizations have to work across devices and be as
appealing and easy-to-use as possible.

As a freelancer, Dominikus has helped create beautiful visualizations for
clients such as the OECD, Microsoft Research or Wincor-Nixdorf. With a focus
on web-based development and casual elegance, the results work everywhere.
As a trainer for data visualization development he holds workshops providing
both a scientific and practical background. He is a regular speaker at
various academic and industry conferences.

Presentations

Data Futures - Exploring the future everyday implications of increasing access to our personal data Session

Increasing access to our personal data raises profound moral and ethical questions. ‘Data Futures’ will highlight the findings of an MFA class in that students observe each other through their own data. It will relate the experiences of the class directly to the conference through a live experiment with the audience that showcases some of the effects of our personal data becoming accessible.

Tim Berglund is a teacher, author, and technology leader with DataStax. He has spoken at numerous conferences internationally and in the United States and contributes to the Denver tech community as president of the Denver Open Source User Group. He is the copresenter of various O’Reilly training videos on topics ranging from Git to Mac OS X productivity tips to Apache Cassandra and is the author of Gradle Beyond the Basics. Tim blogs very occasionally at Timberglund.com. He lives in Littleton, Colorado, with the wife of his youth and their three children.

Presentations

Heraclitus, Enterprise Architecture, and Streaming Data Session

The Greek philosopher Heraclitus famously said, “You never step into the same river twice.” Almost as famous as Heraclitus is Apache Kafka, the de facto standard open-source distributed stream processing system. In this talk, I’ll present several real-world systems build on Kafka, not just as a giant message queue, but as a platform for distributed stream computation.

Ron Bodkin is CTO Architecture and Services for Teradata. Ron is responsible for leading the global emerging technology team focusing on Artificial Intelligence, GPU and Blockchain. Responsible for leading global consulting teams for enterprise analytics architectures combining Hadoop and Spark, public cloud and traditional data warehousing, driving strategic pillar for Teradata.

Previously, Ron was the founding CEO of Think Big Analytics. Think Big provides end to end support for enterprise Big Data including data science, data engineering, advisory and managed services and frameworks such as Kylo for enterprise data lakes. Think Big was acquired by Teradata in 2014 and was the leading global pure play big data services firm.

Previously, Ron was VP Engineering at Quantcast where he led the data science and engineer teams that pioneered the use of Hadoop and NoSQL for batch and real-time decision making. Prior to that, Ron was Founder of New Aspects, which provided enterprise consulting for Aspect-oriented programming. Ron was also Co-Founder and CTO of B2B applications provider C-Bridge, which he led to team of 900 people and a successful IPO. Ron graduated with honors from McGill University with a B.S. in Math and Computer Science. Ron also earned his Master’s Degree in Computer Science from MIT, leaving the PhD program after presenting the idea for C-bridge and placing in the finals of the 50k Entrepreneurship Contest.

Presentations

Fighting financial fraud at Danske Bank with artificial intelligence Session

Fraud in banking is an arms race with criminals using machine learning to improve their attack effectiveness. Danske Bank are fighting back with deep learning. Learn how the leader in mobile payments in Denmark have implemented boosted decision trees in Spark and deep neural nets in TensorFlow. Hear operational considerations in training and deploying models, and lessons learned.

Training Recommendation Models Tutorial

Learn to apply Deep Learning to improve consumer recommendations. We train neural nets to learn categories of interest for recommendations (e.g., for cold start) using embeddings. Learn how to extend this with WALS Matrix Factorization to achieve Wide & Deep Learning - which is now used in production for the Google Play store. Learn with TensorFlow on our cloud GPU (or bring your own GPU laptop).

Charles is the Chief Innovation Officer for Clearsense, a healthcare analytics organization specializing in bringing Big Data technologies to healthcare. Prior to Clearsense, Charles was the Enterprise Analytics Architect for Stony Brook Medicine. He has developed the analytics infrastructure to serve the clinical, operational, quality, and research needs of the organization. He was a founding member of the team that developed the Health and Human Services award-winning application “NowTrending” to assist in the early detection of disease outbreaks utilizing social media feeds. Charles is the Immediate Past President of the American Nursing Informatics Association.

Presentations

Spark Clinical Surveillance: Saving Lives and Improving Patient Care Session

Clearsense uses Spark Streaming to provide real-time updates to healthcare providers for critical healthcare needs. Clinicians make timely decisions from the assessment of a patient's risk for Code Blue, Sepsis, and other conditions based on information gathered from streaming physiological monitoring along with streaming diagnostic data and the patient historical record.

Matt Bolte
Technical Expert Wal-Mart Stores, Inc.
19 years IT experience, 5 years experience with large secure enterprise Hadoop clusters.
Distributions supported: Cloudera, Pivotal, and HortonWorks.

Presentations

An Authenticated Journey Through Big Data Security at Wal-Mart Session

In today’s world of data breaches and hackers, security is one of the most important components for Big Data systems but unfortunately it is usually the one area least planned and architected. We will walk through one large company’s journey with authentication and give examples of how decisions made early can have significant impact throughout the maturation of your Big Data environment.

Tobi Bosede is a machine learning engineer at Capital One. She has also taught R programming at Johns Hopkins University and python programming for the General Assembly. Tobi’s professional work spans multiple industries from telecom at Sprint to finance at J.P. Morgan. She holds a bachelors degree in mathematics from the University of Pennsylvania and a masters in applied mathematics and statistics from Johns Hopkins University.

Presentations

Big Data Analysis of Futures Trades Session

Whether an entity seeks to create trading algorithms or mitigate risk, predicting trade volume is an important. This talk focuses on futures trading and relies on Apache Spark for processing the large amount data. We will consider the use of penalized regression splines for trade volume prediction and the relationship between price volatility and trade volume.

David Boyle leads the work of the Insight team at BBC Worldwide, the commercial and global wing of the BBC, where he helps to transform the relationship that BBC Worldwide has with its audience by building premium, industry-leading insight capabilities into consumers, BBC brands, and the market, as well as what connects with audiences emotionally and inspires them. David has spent the last seven years constructing global insight capabilities for the publishing and music industries, which were widely acknowledged as having helped them make quicker, smarter, and bolder decisions for their brands. He joined BBC from HarperCollins Publishers, where as SVP of consumer insight he helped the company better understand consumer behavior and attitudes toward books, authors, book discovery, and purchase. Prior to that he was at EMI Music, where he delivered insight to all parts of the business in more than 25 countries and helped to shift the organization’s decision making at all levels, from artist signing to product and brand development plans for EMI’s biggest artists, including the Beatles and Pink Floyd.

Presentations

From the weeds to the stars: how and why to think about bigger problems Session

Too many brilliant analytical minds are wasted on interesting but ultimately less-impactful problems - stuck in the weeds of the data / challenges of our day-to-day. Too few ask what it means to reach for the stars - the big, shiny, business-changing issues. Come to get fired up about asking bigger questions and making a bigger difference! Examples from experience in politics, music and TV.

Matt Brandwein is director of product management at Cloudera, driving the platform’s experience for data scientists and data engineers. Before that, Matt led Cloudera’s product marketing team, with roles spanning product, solution, and partner marketing. Previously, he built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in computer science and mathematics from the University of Massachusetts Amherst.

Presentations

Data Science at Team Scale: Considerations for sharing, collaboration, and getting to production Session

Data science alone is easy. Data Science with others, in the enterprise, on shared distributed systems, requires a bit more work. This talk will discuss common technology considerations and patterns for collaboration in large teams, as well as moving machine learning into production at scale.

Claudiu Branzan is the director of data science at G2 Web Services where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine-learning and distributed-systems experience. Previously, Claudiu worked for Atigeo Inc, building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with spaCy, Spark ML & TensorFlow Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. This is a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, TensorFlow for training custom machine learned annotators, and Spark ML & TensorFlow for using deep learning to build & apply word embeddings.

Richard Brath has been designing and building innovative information visualizations for 20 years, ranging from one of the first interactive 3D financial visualizations on the web in 1995, to visualizations embedded in financial data systems used every day by thousands of market professionals. Richard is pursuing a PhD in new text visualization techniques at LSBU.

Presentations

Text Analytics and New Visualization Techniques Session

Text analytics are advancing rapidly and new visualization techniques for text are providing new capabilities. We're inventing new ways to organize massive volumes of text; characterize the subjects; score synopses; skim through lots documents.

Mikio Braun is delivery lead for recommendation and search at Zalando, one of the biggest European fashion platforms. Mikio holds a PhD in machine learning and worked in research for a number of years before becoming interested in putting research results to good use in the industry.

Presentations

Deep learning in practice Session

Deep learning has become the go-to solution for many application areas, such as image classification or speech processing, but does it work for all application areas? Mikio Braun offers background on deep learning and shares his practical experience working with these exciting technologies.

Tamara Broderick is the ITT Career Development Assistant Professor in the Department of Electrical Engineering and Computer Science at MIT. She is a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), the MIT Statistics and Data Science Center, and the Institute for Data, Systems, and Society (IDSS). She completed her Ph.D. in Statistics with Professor Michael I. Jordan at the University of California, Berkeley in 2014. Previously, she received an AB in Mathematics from Princeton University (2007), a Master of Advanced Study for completion of Part III of the Mathematical Tripos from the University of Cambridge (2008), an MPhil by research in Physics from the University of Cambridge (2009), and an MS in Computer Science from the University of California, Berkeley (2013). Her recent research has focused on developing and analyzing models for scalable Bayesian machine learning—-especially Bayesian nonparametrics. She has been awarded a Google Faculty Research Award, the ISBA Lifetime Members Junior Researcher Award, the Savage Award (for an outstanding doctoral dissertation in Bayesian theory and methods), the Evelyn Fix Memorial Medal and Citation (for the Ph.D. student on the Berkeley campus showing the greatest promise in statistical research), the Berkeley Fellowship, an NSF Graduate Research Fellowship, a Marshall Scholarship, and the Phi Beta Kappa Prize (for the graduating Princeton senior with the highest academic average).

Presentations

Kalah Brown is a Sr. Hadoop Engineer at Big Fish Games, a leading producer and distributor of casual and mid-core games. She is responsible for the technical leadership and development of Big Data Solution.

Prior to joining Big Fish, she was a consultant in the greater Seattle area and worked with numerous companies including Disney, Starbucks, the Bill and Melinda Gates Foundation, Microsoft and Premera Blue Cross. She has 17 years of experience in software development, data warehousing, and business intelligence.

Presentations

Working within the Hadoop Ecosystem to Build a Live Streaming Data Pipeline Session

A growing number of companies are interested in processing and analyzing live streaming data. While the Hadoop ecosystem includes platforms and software library frameworks to support this work, these components require the correct architecture, performance tuning and customization. We will share our experience working with Spark, Flume, and Kafka to build a live streaming data pipeline.

Kurt leads the Data Platform team at Netflix. His group architects and manages the technical infrastructure underpinning the company’s analytics. The Netflix data platform includes various big data technologies (e.g. Spark, Hadoop, and Presto), Netflix open sourced applications and services (e.g. Genie and Lipstick), and traditional BI tools (e.g. Tableau and Redshift).

Presentations

20 principles & practices (Netflix-style!) to get the most out of your data platform Session

How can you get the most out of your data infrastructure? Come and find out what we do at Netflix and why. We'll run through 20 principles & practices that we've refined and embraced over time. For each one, we'll weave in how they interplay with the technologies we use at Netflix (e.g. S3, Spark, Presto, Druid, R, Python, Jupyter,...).

Marc Carlson is a lead Computational Biologist in Research Informatics. Marc divides his time between helping to architect new cloud based infrastructure to serve the scientists at SCRI, working to make sure that new compute resources are brought online and properly configured for immediate utility, and helping users with their data and analysis needs via the Bioinformatics Unit. The goal of the Bioinformatics Unit is to make sure that scientists at SCRI can learn the most from their data. Marc’s Bioinformatics Unit contributions include creating and running training courses, periodic consultations, and helping with the Bioinformatics User Group. Marc has been part of Seattle Children’s since September 2015. Prior to joining SCRI, Marc received his B.S. in genetics and cell biology from Washington State University followed by a Ph.D. in developmental and cell biology from the University of CA in Irvine. He subsequently did postdoctoral work in computational Biology at UCLA before joining the Bioconductor core team at the Fred Hutchinson Cancer Research center in 2007 where he served the needs of the R based computational biology community for 8 years.

Presentations

Project Rainier Saving Lives One Insight at a Time Session

Leveraging the power of the Hadoop distributed file system and Hadoop and Spark ecosystem, the scientists at Seattle Children’s Research Institute are able to quickly find new patterns and generate predictions that they can test later. The ultimate goal of Project Rainier is to accelerate important pediatric research, and to increase scientific collaboration by highlighting where it is needed.

Michelle Casbon is director of data science at Qordoba. Previously, she was a senior data science engineer at Idibon, where she built tools for generating predictions on textual datasets. Michelle’s development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. She loves working with open source projects and has contributed to Apache Spark and Apache Flume. Her writing has been featured in the AI section of O’Reilly Radar. Michelle holds a master’s degree from the University of Cambridge, focusing on NLP, speech recognition, speech synthesis, and machine translation.

Presentations

How machine learning with open source tools helps everyone build better products Session

This talk explains the machine learning and natural language processing that enables teams to build products that feel native to every user. It explores how the underserved domain of localization is tackled using primarily open-source tools, including Kubernetes, Docker, Scala, Apache Spark, Apache Cassandra, and Apache PredictionIO (incubating).

Tanya Cashorali is the founding partner of TCB Analytics, a Boston-based data consultancy. Prior to launching TCB Analytics, she worked as a data scientist at Biogen. Tanya started her career in bioinformatics and has applied her experience to other data-rich verticals such as telecom, finance, and sports. She brings over 10 years of experience using R in data scientist roles as well as managing and training data analysts, and she’s helped grow a handful of Boston startups.

Presentations

How to hire and test for data skills: A one-size-fits-all interview kit Session

Given the recent demand for data analytics and data science skills, adequately testing and qualifying candidates can be a daunting task. Interviewing hundreds of individuals of varying experience and skill levels requires a standardized approach. Tanya Cashorali explores strategies, best practices, and deceptively simple interviewing techniques for data analytics and data science candidates.

Simon Chan is a Senior Director of Product Management for Salesforce Einstein where he oversees platform development and delivers products that empower anyone to build smarter apps with Salesforce. Simon is a product innovator and serial entrepreneur with more than 14 years of global technology management experience, working in London, Hong Kong, Guangzhou, Beijing and the Bay Area. Prior to its acquisition by Salesforce in February 2016, Simon was the co-founder and CEO of PredictionIO, a leading open source machine learning server. Following a successful launch in 2012, PredictionIO was named the most popular Spark-based machine learning project on Github. Simon holds a BSE in Computer Science from University of Michigan, Ann Arbor and a PhD in Machine Learning from University College London.

Presentations

The journey to Einstein: Building a multi-tenancy AI platform that powers hundred thousands businesses Session

Salesforce has announced “Salesforce Einstein” last year, which brings AI right into its core platform to power every business. The secret sauce behind Einstein is the underlying platform that accelerates AI development at scale, for both internal and external data scientists. Simon Chan shares his experience building a unified platform for a multi-tenancy multi-business cloud enterprise.

Fallon Chen is a data engineer at Spotify.
Email: fallon@spotify.com

Presentations

Managing core data entities for internal customers at Spotify Session

At Spotify we make data-driven product decisions. As we grow as a company, the magnitude and complexity of the data we care for the most is growing at a rapid pace. During this 40-min presentation, we will walk you through how we store and expose audience data created from multiple internal producers to consumers within Spotify.

Karim Chine is a London-based software architect and entrepreneur. After graduating from Ecole Polytechnique and Telecom ParisTech, he has held positions within academic research laboratories and industrial R&D departments including Imperial College London, EBI, IBM and Schlumberger. Karim’s interests include large scale distributed software design, cloud computing’s applications in research and education, open-source software ecosystems and open science. Since 2009, he has been collaborating with the European Commission as an independent expert for the research e-infrastructure program and for the future and emerging technologies program. He has been an evaluator and a reviewer of many of EU’s flagship projects related to grids, desktop grids, scientific clouds and science gateways. Karim is the author and designer of RosettaHUB.

Presentations

RosettaHUB, towards a global hub for reproducible and collaborative data science Session

RosettaHUB aims at establishing a global open data science meta cloud centered on usability, reproducibility, auditability, and shareability. It enables a wide range of social interactions and real-time collaborations. It leverages clouds and containers and makes them easy to use through converged web consoles, APIs (700+ functions), hybrid R/Python/Scala kernels, a workbench and widgets.

Michael Chui is a Partner in the McKinsey Global Institute. He is based in San Francisco, CA, where he directs research on the impact of disruptive technologies, such as Big Data, social media, and the Internet of Things, on business and the economy. As a McKinsey consultant, Michael served clients in the high-tech, media, and telecom industries on multiple topics. Michael is a frequent speaker at major global conferences and his research has been cited in leading publications around the world.

Michael holds a B.S. in Symbolic Systems from Stanford University and earned a Ph.D. in Computer Science and Cognitive Science, and a M.S. in Computer Science, from Indiana University.

Prior to joining McKinsey, Michael served as the first Chief Information Officer of the City of Bloomington, Indiana, and was the founder and executive director of HoosierNet, a regional Internet service provider.

Presentations

Executive Briefing: Artificial Intelligence Session

This Executive Briefing is a part of the Strata Business Summit. Details to come.

Eric Colson is chief algorithms officer at Stitch Fix as well as an advisor to several big data startups. Previously, Eric was vice president of data science and engineering at Netflix. He holds a BA in economics from SFSU, an MS in information systems from GGU, and an MS in management science and engineering from Stanford.

Presentations

Differentiating by Data Science Session

Many companies use data science as a *supportive* function for various business initiatives. However, the emergence of new business models has made it possible for some companies to *differentiate* via data science. When this is the case the company needs to think very differently about the role and placement of data science in the organization. Spoiler alert: it needs to report into the CEO!

Riccardo Gianpaolo Corbella is a Big Data Engineer in Milan and currently works as consultant at Data Reply IT where he develops effective big data solutions based on open-source technologies. He received a B.Sc. and a M.Sc. in Computer Science from the Università degli Studi di Milano. During his university career Riccardo showed interest about data mining and distributed systems. He tries to join these two topics at Reply where helps some of the biggest players belonging to a broad set of industry fields.

Presentations

How an Italian company rules the world of insurance facing new technological challenges to turn data into value Session

Italy holds an undisputed record in the world of car insurance:with more than 4.5 million of black boxes it is the country with the major number of telematics clients in the world.Behind this explosion is the great investment in big data technologies and new architectures. We are going to highlights the application architecture and provide a real-time data management model to handle this scenario.

Dustin Cote is a customer operations engineer at Confluent. Over his career, Dustin has worked in a variety of roles from Java developer to operations engineer. His most recent focus is distributed systems in the big data ecosystem, with Apache Kafka being his software of choice.

Presentations

Mistakes were made, but not by us: Lessons from a year of supporting Apache Kafka Session

Operational war stories and lessons learned from the last year in supporting Apache Kafka presented from the perspective of an enterprise support team. We share our experience supporting Apache Kafka at enterprise-scale and explore monitoring and troubleshooting techniques to help you avoid pitfalls when scaling large-scale Kafka deployments.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata + Hadoop World conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the first day of keynotes.

Michael Crutcher is responsible for the direction of Cloudera’s storage products. These include HDFS, HBase, Parquet, Kudu, and several others. He’s also responsible for managing strategic partnerships with storage vendors.

Presentations

The Sunset of Lambda: New Architectures Amplify IoT Impact Session

A long time ago in a datacenter far, far away...we deployed complex lambda architectures as the backbone of our IoT solutions. Though hard, they enabled collection of real-time sensor data and slightly delayed analytics. Today, the architecture for IoT data has been simplified by Apache Kudu, a relational storage layer for fast analytics on fast data - the key to unlocking the value in IoT data.

Nick Curcuru
Vice President, Enterprise Information Management Practice
Nick Curcuru brings over 20 years of global experience successfully delivering large scale advanced analytic initiatives for such companies as The Walt Disney Company, Capital One, Home Depot, Burlington Northern Railroad, Merrill Lynch, Nordea Bank and The General Electric Company.
He is responsible for leading the Enterprise Information Management practice in Mastercard Advisors. The team works with organizations to generate revenue through smart data, architect next generation technology platforms and protect data assets from cyber-attacks. The team leverages Mastercard’s own information technology and information security resources creating “peer to peer” collaboration with their clients.
He frequently speaks on big data trends and data security strategies at conferences and symposiums. He has published several articles on security, revenue management and data security, as well as contributed to several books on the topic of data and analytics.

Presentations

Architecting Security across the Enterprise - Instilling Confidence and Stewardship Every Step of the Way Session

Cyber security is now a boardroom topic. Organizations are scrambling to increase their security posture. To decrease breach threats, Mastercard brings data security into their system design process. Listen as Mastercard shares its best practices, protecting 160 million transactions/hour over its network and securing 16+ petabytes of data at rest.

Paul is a Senior Solutions Engineer in Field Engineering at MapR, where he provides pre- and post-sales technical support to MapR’s worldwide Systems Engineering team. Prior to joining MapR, Paul served as Senior Operations Engineer for Unami, a startup founded to deliver on the promise of interactive TV for consumers, networks and advertisers. Previously, Paul was Systems Manager for Spiral Universe, a company providing school administration software as a service. He has also held senior support engineer positions at Sun Microsystems, as well as enterprise account technical management positions for both Netscape and FileNet. Earlier in his career, Paul worked in application development for Applix, IBM Service Bureau, and Ticketron. His background extends back to the ancient personal computing days, having started his first full time programming job on the day the IBM PC was introduced.

Presentations

Why Containers and Microservices Need Streaming Data Session

A microservices architecture benefits from the agility of containers for convenient and predictable deployment of applications, while persistent and performant message streaming makes both work better. This talk explores these infrastructure components and design of highly scalable real world systems that take advantage of this powerful triad, including practical advice about what really works.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the first day of keynotes.

Brian is the Head of Data Science at Zocdoc, an online doctor marketing place and booking tool, and is also an Adjunct Professor for NYU’s Center for Data Science graduate program. Prior to Zocdoc, Brian was VP of Data Science at Dstillery, an online advertising firm. Brian is a veteran data scientist and leader with over 15 years of experience developing machine learning driven practices and products. Brian holds several patents and has published dozens of peer reviewed articles on the subjects of causal inference, large scale machine learning and data science ethics. Brian is also the drummer for the critically acclaimed indie rock band Coastgaard.

Presentations

Challenges in Using Machine Learning to Direct Healthcare Services Session

Zocdoc is an online marketplace that allows easy doctor discovery and instant online booking. Being in the service of healthcare however brings about many constraints and challenges that render standard approaches to common problems not feasible. This talk will survey the various machine learning problems we face and discuss the legal, data and ethical constraints that shape our solution space.

Atul Dalmia is vice president in Global Information Management at American Express. His responsible for leading the company’s data and platform strategy and driving innovation in acquisition, marketing and servicing across the customer lifecycle and across channels. He is also responsible for accelerating development on AXP’s big data platform to drive innovation and speed to market while driving cost efficiencies for the enterprise. Over the last 3 years, Atul has been leading AXP’s transformation using big data to create value for their customers, businesses and merchants. Atul holds a Masters degree from Massachusetts Institute of Technology and a Bachelors degree from Indian Institute of Technology, Chennai.

Presentations

Enterprise Digital Transformation Using Big Data Session

Are you a large company who relies on data and analytics to power your business? Then it’s time to come into the digital age, and big data is the answer! The key to making such a large transformation is enterprise adoption across a variety of end-users. Join American Express to learn best practices from their 5 year journey, the biggest challenges you’ll face and ideas on how to solve for them.

Shirshanka Das is the architect for LinkedIn’s Analytics Platforms and Applications team. Shirshanka was one of the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, Apache Helix and Gobblin. His current focus at LinkedIn includes all things Hadoop, high-performance distributed OLAP engines, large-scale data ingestion, transformation and movement, and data lineage and discovery.

Presentations

Taming the ever-evolving Compliance Beast: Lessons learned at LinkedIn Session

We describe the journey of the big data ecosystem at LinkedIn in preserving member privacy while providing data democracy. We discuss three foundational building blocks for scalable data management that can meet data compliance regulations - a central metadata system, an integrated data movement platform and a unified data access layer.

Prior to joining Amplify as a general partner, Mike Dauber spent over six years at Battery Ventures, where he led early-stage enterprise investments on the West Coast, including Battery’s investment in a stealth security company that is also in Amplify’s portfolio. Most recently, Mike sat on the boards of Continuuity, Duetto, Interana, and Platfora. Mike previously invested in Splunk and RelateIQ, which was recently acquired by Salesforce. Mike began his career as a hardware engineer at a startup and later held product, business development, and sales roles at Altera and Xilinx. Mike is a frequent speaker at conferences and is on the advisory board of both the O’Reilly Strata Conference and SXSW. He was named to Forbes magazine’s 2015 Midas Brink List. Mike holds a BS in electrical engineering from the University of Michigan in Ann Arbor and an MBA from the University of Pennsylvania’s Wharton School.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Gerard de Melo is an Assistant Professor of Computer Science at Rutgers University, heading a team of researchers working on Big Data analytics, natural language processing, and web mining. Over the years, he has published over 80 papers on these topics, with Best Paper/Demo awards at WWW 2011, CIKM 2010, ICGL 2008, the NAACL 2015 Workshop on Vector Space Modeling, as well as an ACL 2014 Best Paper Honorable Mention, a Best Student Paper Award nomination at ESWC 2015, and a thesis award for his work on graph algorithms for knowledge modeling. Notable research projects include UWN/MENTA, one of the largest multilingual knowledge bases, and Lexvo.org, an important hub in the Web of Data.

Prior to joining Rutgers, he had been a faculty member at Tsinghua University, often considered China’s most prestigious university, where he headed the Web Mining and Language Technology group. Previously, he had been a Visiting Scholar at UC Berkeley, working in the ICSI AI group. He received his doctoral degree in computer science at the Max Planck Institute for Informatics. Gerard de Melo serves as an Editorial Board Member for Computational Intelligence, for the Journal of Web Semantics, for the Springer Language Resources and Evaluation journal, and for the Language Science Press TMNLP book series.

Presentations

Learning Meaning from Web-Scale Big Data HDS

How can we exploit the massive amounts of data now available on the Web to enable more intelligent applications? This talk presents results on learning more advanced representations of language and of world knowledge by applying Deep Learning techniques to Web-scale amounts of data. The resulting resources can be used in Spark, for instance, to work with text in over 300 languages.

Beniamino Del Pizzo is a big data engineer working on data ingest and focusing
on Apache Kafka and Spark applications at Data Reply IT, leading Italian consulting company of big data industry.
Beniamino has a master’s degree in computer engineering with a thesis on “an evolutionary approach on Apache Spark to learn TSK-fuzzy systems for big data”. He is passionate about big data, streaming application, distributed computation and data analysis.

Presentations

How an Italian company rules the world of insurance facing new technological challenges to turn data into value Session

Italy holds an undisputed record in the world of car insurance:with more than 4.5 million of black boxes it is the country with the major number of telematics clients in the world.Behind this explosion is the great investment in big data technologies and new architectures. We are going to highlights the application architecture and provide a real-time data management model to handle this scenario.

Noemi Derzsy is currently a postdoctoral research associate at Social Cognitive Network Academic Research Center at Rensselaer Polytechnic Institute and a NASA Datanaut. Having a PhD in Physics and research background in Physics and Computer Science, she uses data sets to analyze, understand and model complex systems using Network Science and Data Science techniques.

Presentations

Topic Modeling Open NASA Data Session

Open source data has enabled society to engage in community-based research, and has provided government agencies with more visibility and trust from individuals. I will briefly introduce the openNASA platform with over 32,000 open NASA datasets, and I will present open NASA metadata analysis, and tools for applying NLP/topic modeling techniques to understand open government dataset associations.

Stephen Devine is a data engineer at Big Fish Games in Seattle, this involves wrangling events sent from millions of mobile phones through Kafka into a Hive. He previously did similar things for Xbox One Live Services using proprietary Microsoft technology and before that worked on several releases of Internet Explorer.

Presentations

Working within the Hadoop Ecosystem to Build a Live Streaming Data Pipeline Session

A growing number of companies are interested in processing and analyzing live streaming data. While the Hadoop ecosystem includes platforms and software library frameworks to support this work, these components require the correct architecture, performance tuning and customization. We will share our experience working with Spark, Flume, and Kafka to build a live streaming data pipeline.

Ewa is responsible for SQL workload optimization solutions including traditional data warehouse workloads offload, and Impala/Hive workloads optimization. She manages product direction and strategy of Navigator Optimizer (formerly known as Xplain.io). Prior to Cloudera, Ewa held leadership positions driving product strategy and product design for several enterprise SaaS applications, including Xplain.io.

Presentations

Optimizing the Data Warehouse at Visa Session

At Visa, the process of optimizing the enterprise data warehouse and consolidating data marts by migrating these analytic workloads to Hadoop has played a key role in the adoption of the platform and how data has transformed Visa as an organization. This talk will look at Visa’s experience with this process and provide some best practices for organizations migrating workloads to Hadoop.

Leo has a background in physics, and started writing software professionally in the 1980’s. In 2012 he became fascinated with deep learning, and has been building systems with it ever since. He led the engineering team that launched the Amazon Machine Learning service in 2015. Now he works in the Amazon AI group in Amazon Web Services.

Presentations

Practical Deep Learning for Understanding Images Session

Learn how to apply the latest deep learning techniques to semantically understand images, without needing a PhD in machine learning.  Learn what embeddings are, how to extract them from your images using deep convolutional neural networks (CNNs), and how they can be used to cluster and classify large datasets of images.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

GDPR: Getting Your Data Ready for Heavy New EU Privacy Regulations Session

General Data Protection Regulation (GDPR) will go into effect in May 2018 for firms doing any business in the EU. However, many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance as well as future regulations.

Mike Driscoll founded Metamarkets in 2010 after spending more than a decade developing data analytics solutions for online retail, life sciences, digital media, insurance, and banking. Metamarkets provides an end-to-end analytics solution for leaders in programmatic marketing, including Twitter, LinkedIn and AOL. Prior, Mike successfully founded and sold two companies: Dataspora, a life science analytics company, and CustomInk, an early pioneer in customized apparel. He began his career as a software engineer for the Human Genome Project. Mike holds an A.B. in Government from Harvard and a Ph.D. in Bioinformatics from Boston University.

Presentations

The Cognitive Design Principles of Interactive Analytics Session

Most analytics tools in use today provide static visuals that don’t reveal the full, real-time picture. In this session, Mike Driscoll will discuss how to take an interactive approach to analytics. From design techniques to discovering new forms of data exploration, he will reveal how to put the full power of big data into the hands of the people who need it to make key business decisions.

Founder and CEO at Estimize

Presentations

Crowdsourced Alpha: The Future of Investment Research FinData

Founder and CEO at Estimize

Mathieu Dumoulin is a data scientist in MapR Technologies’s Tokyo office, where he combines his passion for machine learning and big data with the Hadoop ecosystem. Mathieu started using Hadoop from the deep end, building a full unstructured data classification prototype for Fujitsu Canada’s Innovation Labs, a project that eventually earned him the 2013 Young Innovator award from the Natural Sciences and Engineering Research Council of Canada. Afterward, he moved to Tokyo with his family where he worked as a search engineer at a startup and a managing data scientist for a large Japanese HR company, before coming to MapR.

Presentations

State of the Art Robot Predictive Maintenance with Real-Time Sensor Data Session

See a working, practical, predictive maintenance pipeline in action. We'll show how we built a state of the art anomaly detection system using well know, standard big data frameworks like Spark, H2O, TensorFlow, Kafka on the MapR Converged Data Platform. Building on our Strata Beijing presentation, we’ll show an improved deep learning-based model with significantly improved performance.

Ted Dunning has been involved with a number of startups—the latest is MapR Technologies, where he is chief application architect working on advanced Hadoop-related technologies. Ted is also a PMC member for the Apache Zookeeper and Mahout projects and contributed to the Mahout clustering, classification, and matrix decomposition algorithms. He was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Presentations

Tensor Abuse in the Workplace Session

I will show, in practical terms, both how tensor computing works and how it can be put to good use in a variety of settings beyond the most common use of training deep neural networks. Attendees will learn the high-level principles behind these systems and will leave with an ability to understand what is really going on, along with some practical open-source examples.

Mateusz is a Tokyo-based software engineer at H2O.ai, the maker behind H2O, the leading open source machine learning platform for smarter applications and data products. He works on distributed machine learning projects including the core H2O platform and Sparkling Water, which integrates H2O and Apache Spark. Previously, he worked at Fujitsu Laboratories on natural language processing and utilization of machine learning techniques for investments. After Fujitsu he moved to Infoscience to work on a highly distributed log data collection and analysis platform.

Mateusz loves all things distributed and machine learning, and hates buzzwords. In his spare time he participates in the IT community by organizing, attending, and speaking at conferences and meetups. Mateusz holds an MSc in computer science from AGH UST in Krakow.

Presentations

State of the Art Robot Predictive Maintenance with Real-Time Sensor Data Session

See a working, practical, predictive maintenance pipeline in action. We'll show how we built a state of the art anomaly detection system using well know, standard big data frameworks like Spark, H2O, TensorFlow, Kafka on the MapR Converged Data Platform. Building on our Strata Beijing presentation, we’ll show an improved deep learning-based model with significantly improved performance.

Barbara Eckman is a Principal Data Architect at Comcast. She leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing Big Data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project Center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.

Presentations

End-to-end Data Discovery and Lineage in a Heterogeneous Big Data Environment with Apache Atlas and Avro Session

Comcast’s Streaming Data platform comprises a variety of ingest, transformation, and storage services. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have extended Apache Atlas with custom data and process types for data discovery and lineage. Custom asynchronous messaging libraries notify Atlas of new data and schema entities and lineage links as they are created.

Eilbacher is an experienced operations and client services professional with a successful track record for providing technology solutions and services that focus on uncovering analytics insights and driving efficiency across an enterprise. He works directly with clients to develop strategies and implement solutions that transform structured and unstructured data into analytics-driven business insights. He has a strong background in technology and a deep appreciation for finding the right solution. Before joining the Caserta team in 2013, he served in executive roles at Verint and Ness Technologies.

Presentations

Creating a DevOps Practice for Analytics Session

Building an efficient analytics environment requires a strong infrastructure. In this presentation, Bob Eilbacher, VP, Operations, Caserta, discusses how to implement a strong DevOps practice for data analysis – starting with the necessary cultural changes that must be made at the executive level, and ending with an overview of potential DevOps toolchains.

Amie Elcan has worked in the telecommunications industry for over 20 years delivering traffic based assessments that drive optimal network architecture and engineering design decisions. She is currently a Principal Architect at CenturyLink in the Product Development and Technology organization. Her current areas of focus at CenturyLink are traffic modeling, network analytics, and data science.

Presentations

Classification of Telecom Network Traffic: Insight Gained Using Statistical Learning on a Big Data Platform DCS

The predominant applications on the Internet continually evolve, creating dynamic and unpredictable traffic patterns on telecom networks. For such complex systems, it is essential to apply analytics and statistical learning techniques on network device measurements at scale to understand, classify and predict traffic patterns.

Since joining DHL Supply Chain in 2012, Xavi has served in different roles driving the standardization & innovation agenda in Europe such as DHL’s Vision Picking, Robotics and Internet of Things. Since early 2017, Xavi has also taken over the responsibility of DHL IT solutions that are proposed to DHL customers in EMEA. Xavi has an MSC in Computer Engineering from Universidad Politechnica Catalonia in Barcelona and he is currently based in DHL Headquarters in Bonn, Germany.

Presentations

Seeing everything so managers can act on anything: IoT in DHL Supply Chain operations Data 101

DHL has created an IoT initiative for its supply chain warehouse operations. Utilizing immersive operational data visualization, DHL is gaining unprecedented insight -- from the most comprehensive global view across all locations, down to zooming in on a unique data feed from a single sensor -- to see, understand and act on everything that occurs in its warehouses

Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Prior to Alluxio, Bin worked at Google building next-generation storage infrastructure, where he won Google’s Technical Infrastructure award. Bin has a PhD in computer science from Carnegie Mellon University.

Presentations

Best Practices for using Alluxio with Spark Session

Alluxio (formerly Tachyon) is a memory speed virtual distributed storage system and leverages memory for managing data across different storage. Many deployments use Alluxio with Spark because Alluxio helps Spark be more effective and further accelerate applications. We discuss how Alluxio helps Spark be more effective and describe production deployments of Alluxio and Spark working together.

Avrilia Floratau is a senior scientist at Microsoft’s Cloud and Information Services Lab, where her research is focused on scalable real-time stream processing systems. She is also an active contributor to Heron, collaborating with Twitter. Previously, Avrilia was a research scientist at IBM Research working on SQL-on-Hadoop systems. She holds a PhD in data management from the University of Wisconsin-Madison.

Presentations

Modern Real Time Streaming Architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems, algorithms and deployment architectures.

Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

A Deep Dive into Running Data Engineering Workloads in AWS Tutorial

Data engineering workloads are foundational workloads run prior to most analytic and operational database use cases. This hands-on tutorial will provide a deep dive into running data engineering workloads in a managed service capacity in the public cloud; highlight AWS infrastructure best practices; and discuss how data engineering workloads interoperate with data analytic workloads.

Michael J. Freedman is a Professor in the Computer Science Department at Princeton University, as well as the co-founder and CTO of Timescale, building an open-source database that scales out SQL for time-series data. His work broadly focuses on distributed systems, networking, and security.

He developed and operated several self-managing systems — including CoralCDN, a decentralized content distribution network, and DONAR, a server resolution system that powered the FCC’s Consumer Broadband Test — which reached millions of users daily. Freedman’s work on IP geolocation and intelligence led him to co-found Illuminics Systems, which was acquired by Quova (now part of Neustar) in 2006. His work on programmable enterprise networking (Ethane) helped form the basis for the OpenFlow / software-defined networking (SDN) architecture. Freedman is also a technical advisor to Blockstack, building decentralized services leveraging the blockchain.

Honors include a Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), Sloan Fellowship, NSF CAREER Award, Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. He received his Ph.D. in computer science from NYU’s Courant Institute and his S.B. and M.Eng. degrees from MIT.

Presentations

When boring is awesome: Making PostgreSQL scale for time-series data Session

Time-series data is emerging everywhere, with very different workload characteristics that traditional transactional workloads. Michael Freedman outlines a new scale-out database designed for time-series workloads, yet open source and engineered up as a plugin to Postgres. Unlike most time-series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries.

Jon Fuller is an application scientist at KNIME.com where he works with customers to deploy advanced analytics and help them understand the power of working with cloud resources. Previously Jon worked as a Postdoctoral researcher at the Heidelberg Institute for Theoretical Studies where he published several papers on computational biology topics. Jon holds a PhD in Bioinformatics from the University of Leeds, and is a lapsed physicist.

Presentations

Deploying Deep Learning to Assist the Digital Pathologist Session

KNIME, Apache Spark, and MS Azure are here at work to enable fast, cheap and automated classification of malignant lymphoma type in digital pathology images. The trained model is deployed to end-users as a web application using the KNIME WebPortal.

Eddie Garcia is chief information security officer at Cloudera, a leader in enterprise analytic data management, where he draws on his more than 20 years of information and data security experience to help Cloudera enterprise customers reduce security and compliance risks associated with sensitive data sets stored and accessed in Apache Hadoop environments. Previously, Eddie was the vice president of infosec and engineering for Gazzang prior to its acquisition by Cloudera, where he architected and implemented secure and compliant big data infrastructures for customers in the financial services, healthcare, and public sector industries to meet PCI, HIPAA, FERPA, FISMA and EU data security requirements. He was also the chief architect of the Gazzang zNcrypt product and is author of three patents for data security.

Presentations

Machine learning to "spot" cybersecurity incidents at scale Session

Machine data from firewalls, network switches, DNS servers and many other devices in your organization may be untapped potential for cybersecurity threat analytics using Machine Learning. Attend this session to see how Apache Spot (incubating) may help your organization detect and prevent malicious behavior on your infrastructure.

Yael Garten leads a team of data scientists at LinkedIn that focuses on understanding and increasing growth and engagement of LinkedIn’s 460 million members across mobile and desktop consumer products. Yael is an expert at converting data into actionable product and business insights that impact strategy. Her team partners with product, engineering, design, and marketing to optimize the LinkedIn user experience, creating powerful data-driven products to help LinkedIn’s members be productive and successful. Yael champions data quality at LinkedIn; she has devised organizational best practices for data quality and developed internal data tools to democratize data within the company. Yael also advises companies on informatics methodologies to transform high-throughput data into insights and is a frequent conference speaker. She holds a PhD in biomedical informatics from the Stanford University School of Medicine, where her research focused on information extraction via natural language processing to understand how human genetic variations impact drug response, and an MSc from the Weizmann Institute of Science in Israel.

Presentations

The Unspoken Challenges of Doing Data Science Session

Data science is a rewarding career. It's also hard—not just the technical work itself but also "how to do the work well" in an organization. Garten explores what data scientists do, how they fit into the broader company organization, how they can excel at their trade, shares the hard and soft skills required, and challenges to watch out for, and tips & tricks for success and #DataScienceHappiness.

Alison has worked at Spotify since 2014 and has coached and led teams in backend services and data infrastructure. Prior to Spotify she led engineering teams at non-profit organizations in education and corporate social responsibility.

Presentations

Spotify in the Cloud: the next evolution of Data @ Spotify Session

In early 2016, Spotify decided that we didn’t want to be in the datacenter business. The future was big and fluffy; the future was the cloud. In this talk, two leaders from Data Infrastructure at Spotify will walk through what it takes to move to the cloud. We’ll do an overview of technology choices and challenges in the cloud as well as some of the lessons our organization learned along the way.

Daniel Goddemeyer is the founder of OFFC NYC, a New York City based research and design studio, that works with global brands, research institutions and start-ups to explore future product applications for today’s emerging technologies.

With his own research and his MFA class at the School of Visual Arts he explores how the increasing proliferation of these technologies in our future lives will transform our everyday interactions.

For his work he has received several distinctions and awards from the Art Directors Club, the Red Dot Award, the German Design Price, the Kantar Information is Beautiful award and the Industrial Designer Society of America.

His work has been exhibited internationally at the Westbound Shanghai Architecture Biennial, Data in the 21st century exhibition at V2 Rotterdam, Data Traces Riga and the Big Bang Data exhibition at London’s Somerset house among others.

Presentations

Data Futures - Exploring the future everyday implications of increasing access to our personal data Session

Increasing access to our personal data raises profound moral and ethical questions. ‘Data Futures’ will highlight the findings of an MFA class in that students observe each other through their own data. It will relate the experiences of the class directly to the conference through a live experiment with the audience that showcases some of the effects of our personal data becoming accessible.

Brian is an Associate Professor of Physics and Data Science at Cal Poly State University in San Luis Obispo, CA, where he teaches Data Science. He is a leader of the IPython project, co-founder of Project Jupyter and is an active contributor to a number of other open source projects focused on data science in Python. Recently, he co-created the Altair package for statistical visualization in Python. He is a advisory board member of the NumFOCUS Foundation and a faculty fellow of the Cal Poly Center for Innovation and Entrepreneurship.

Presentations

JupyterLab: Building blocks for interactive computing Session

JupyterLab is an extensible IDE-like web application for data science and computation, and is the next generation of the popular Jupyter Notebook. Users compute with multiple notebooks, editors, and consoles that work together in a tabbed layout. We demonstrate JupyterLab and show how users can use third-party plugins to extend and customize many aspects of JupyterLab for their workflow and data.

Mark Grover is a software engineer working on Apache Spark at Cloudera. Mark is a committer on Apache Bigtop and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Presentations

Architecting a next generation data platform Tutorial

Using the Internet of Things and Customer 360 as an example, we’ll explain how to architect a modern, real-time big data platform leveraging recent advancements in open-source software. We’ll show how components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL along with Apache Hadoop can enable new forms of data processing and analytics.

Nadeem Gulzar is the Head of Advanced Analytics & Architecture within Danske Bank Group, a Nordic bank with strong roots in Denmark and focus on becoming the most trusted financial partner in the Nordics. For the last 2 years, Nadeem has taken the lead in establishing Advanced Analytics and Big Data Technologies within Danske. Previously, Nadeem worked with Credit and Marketrisk, where he headed the programme to build-up capabilities to calculate risk using Monte Carlo simulation methods. Nadeem has graduated from Copenhagen University with a BS in Computer Science, Mathematics and Psychology and holds a Master’s Degree in Computer Science from Copenhagen University.

Presentations

Fighting financial fraud at Danske Bank with artificial intelligence Session

Fraud in banking is an arms race with criminals using machine learning to improve their attack effectiveness. Danske Bank are fighting back with deep learning. Learn how the leader in mobile payments in Denmark have implemented boosted decision trees in Spark and deep neural nets in TensorFlow. Hear operational considerations in training and deploying models, and lessons learned.

Alexandra is a data scientist at Arundo Analytics with a background in mechanical engineering and applied numerical methods.

Presentations

IIoT Data Fusion: Bridging the gap from data to value Session

One of the main challenges when working with industrial data is not only the amount of data, but also the ability to link these data and extract value. We propose a comprehensive pre-processing methodology which structures and links data from different sources, thus converting the IIoT analytics process from an unorganized mammoth to one which is more likely to generate insight.

Sijie Guo is the co-founder of Streamlio that focuses on building next generation real time data stack. Before Streamlio, he was the tech lead for messaging group at Twitter where he co-created Apache DistributedLog. He is also the PMC chair of Apache BookKeeper. Prior to Twitter, he worked on Yahoo! push notification infrastructure at Yahoo.

Presentations

Messaging, Storage or Both: The Real Time Story of Pulsar and Apache DistributedLog Session

Modern enterprises not only produce data at volume but also at a high velocity. To keep up with the velocity and also process data in real time, new type of storage systems are designed, implemented and deployed. This talk focuses on Apache DistributedLog and Pulsar which are real time storage systems built using Apache Bookkeeper and used heavily in production.

Modern Real Time Streaming Architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems, algorithms and deployment architectures.

Yufeng is a Developer Advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all.
He is enjoys hearing about new and interesting applications of machine learning, share your use case with him on Twitter @YufengG

Presentations

Getting started with TensorFlow Tutorial

We will walk you through training and deploying a machine-learning system using TensorFlow, a popular open source library. Starting from conceptual overviews, we will build all the way up to complex classifiers. You’ll gain insight into deep learning and how it can apply to complex problems in science and industry.

Dr. Yunsong Guo is a staff engineer at Pinterest, developing Homefeed Ranking ML models in the past 3 years. He is the founding member of the HF ranking team and has led keys projects to turn Pinterest Homefeed ranking from time-based to logistic regression based, and later to GBDT powered ranking systems. Such projects and feature improvements resulted in more than 100% Homefeed user engagement gains. Prior to Pinterest, Yunsong obtained his PhD in Computer Science focusing on Machine Learning from Cornell University, and spent a few years working in London and Hong Kong on algorithmic trading, high frequency trading and statistical arbitrage using machine learned models.

Presentations

Building for both Apples and Oranges - How Pinterest adapted their Machine Learning Frameworks, Features, and Training Systems to Amplify International Growth HDS

In the last two years, Pinterest transitioned from a world with a small minority of international users to having more than half of our traffic from outside of the US. As a result, Pinterest's historical users were no longer the same as the new ones, posing predictive challenges. This talk explores multiple strategies we used to prime our ML systems to drive this new international growth.

Sebastian Gutierrez is a Data Entrepreneur who focuses on data-driven companies.

In Data Visualization, Sebastian founded DashingD3js.com to provide online and corporate training in data visualization and D3.js to a diverse client base, including corporations like the New York Stock Exchange, the American Express Company, Intel, General Dynamics, Salesforce, Thomson Reuters, Oracle, Bloomberg Businessweek, universities, and dozens of startups. More than 1,000 people have attended his live trainings and many more have succeeded with his online D3.js training.

In Data Science, Sebastian co-founded DataScienceWeekly.org to provide news, analysis, and commentary in data science. The Data Science Weekly newsletter reaches 10’s of thousands of aspiring and professional data scientists on a weekly basis. Sebastian is also the author of “Data Scientist at Work”, a collection of interviews with many of the world’s most influential and interesting data scientists from across the spectrum of public companies, private companies, startups, venture investors, and non-profits.

Sebastian Gutierrez holds a BS in Mathematics from MIT and an MA in Economics from the University of San Francisco.

Presentations

How You Can Use The Science of Human Perception To Improve Your Business Decision-Making Session

You use business metrics and analytics to achieve success in your data-driven organization. You visualize and communicate what data are saying in order to achieve individual, team, and organizational goals. This session will show you how to use the science of human perception to drastically improve your data visualizations, reports, and dashboards to drive better decisions and results.

Felix GV is a software engineer working on LinkedIn’s data infrastructure. He works on Voldemort and Venice, and keeps a close eye on Hadoop, Kafka, Samza, Azkaban and other systems.

Presentations

Introducing Venice: a Derived Datastore for Batch, Streaming and Lambda Architectures Session

As companies build batch and stream processing pipelines, the next stage in their evolution is to serve the insights they gleaned back to their users. This often-overlooked problem can be hard to achieve reliably and at scale. Venice is a new datastore capable of ingesting data from Hadoop and Kafka, merging it together, replicating it globally, and serving it online at low latency.

Patrick Hall is a senior data scientist and product engineer at H2o.ai. Patrick works with H2o.ai customers to derive substantive business value from machine learning technologies. His product work at H2o.ai focuses on two important aspects of applied machine learning, model interpretability and model deployment. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning.

Prior to joining H2o.ai, Patrick held global customer facing roles and R & D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the 11th person worldwide to become a Cloudera certified data scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Interpretable AI is not just for regulators! Session

Interpreting machine learning models is not just another regulatory burden to be overcome. Practitioners, researchers, and consumers that use these technologies in their work and their day-to-day lives have the right to trust and understand AI. This talk is an overview of techniques for interpreting machine learning models and telling stories from their results

Behrooz Hashemian, Ph.D., is a researcher and chief data officer at Massachusetts Institute of Technology (MIT), Senseable City Lab. He investigates the innovative implementation of big data analytics and artificial intelligence in smart cities, finance, and healthcare. He is a lead data scientist with expertise in developing predictive analytics strategies, machine learning solutions, and data-driven platforms for informed decision making. His work endeavors to bridge the gap between academic research and industrial deployment of big data analytics and artificial intelligence. Dr. Hashemian leads an unprecedented project on anonymized data fusion, which provides a multidimensional insight into urban activities and customer behaviors from multiple sources.

Presentations

Anonymized Data Fusion: Privacy vs Utility Session

Since the advent of pervasive digital technologies, people are leaving an increasing amount of digital traces in their everyday life. Since these traces are mostly anonymized, the information gained by advanced data analytics is contained to each individual trace. However, taking advantage of patterns in people's behavior, we are able to fuse various traces and build a multidimensional insight.

Bill Havanki is a software engineer working for Cloudera, where he has contributed to Hadoop components as well as systems for deploying Hadoop clusters into public Cloud services. Prior to joining Cloudera he worked for 15 years developing software for government contracts, focusing mostly on analytic frameworks and authentication and authorization systems. He earned his B.S. in Electrical Engineering from Rutgers University and his M.S. in Computer Engineering from North Carolina State University. A New Jersey native, he currently lives near Annapolis, Maryland with his family.

Presentations

Automating Cloud Cluster Deployment: Beyond the Book Session

Speed and reliability in deploying big data clusters is key for effectiveness in the cloud. The new book _Moving Hadoop to the Cloud_ covers essential practices like baking images and automating cluster configuration. This session builds on them to describe how you can automate the creation of new clusters from scratch and use metrics gathered using the cloud provider to scale up.

Katherine is an Assistant Professor at Duke University, in the Department of Statistical Science and at the Center for Cognitive Neuroscience, as well as the Departments of Computer Science, Electrical and Computer Engineering. Prior to joining Duke she was an NSF Postdoctoral Fellow, in the Computational Cognitive Science group at MIT, and an EPSRC Postdoctoral Fellow at the University of Cambridge. Her Ph.D. is from the Gatsby Unit, at University College London.

Katherine’s research interests lie in the fields of machine learning and Bayesian statistics. Specifically, she develops new methods and models to discover latent structure in data, including cluster structure, using Bayesian nonparametrics, hierarchical Bayes, time series techniques, and other Bayesian statistical methods.
She applies these methods to problems in the brain and cognitive sciences, human social interactions, and clinical medicine. Katherine has been the fortunate recipient of a first round NSF BRAIN initiative award, a Google faculty research award, and an NSF CAREER award.

Presentations

Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic-net regularization in Spark’s ML library, one-pass elastic-net linear regression, and is leading development efforts for a new Spark ML optimization library. He has also made extensive contributions to Spark ML decision trees and ensemble algorithms. Prior to joining Cloudera, he worked on Spark MLlib as a machine learning engineer at IBM’s Spark Technology Center. He also earned his M.S. in electrical engineering from Georgia Institute of Technology.

Presentations

Boosting Spark MLlib Performance with Rich Optimization Algorithms Session

Recent developments in Spark MLlib have given users the power to express a wider class of ML models and decrease model training times via the use of custom parameter optimization algorithms. Seth Hendrickson and DB Tsai discuss when and how to use this new API,and walk you through creating your own Spark ML optimizer. They also show performance benefits and real-world use cases.

Extending Spark ML: Adding your own tools & algorithms Session

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. Even if you don't have your own algorithm to add, this talk will give you a deeper understanding of Spark's ML pipelines.

J.C. Herz is co-founder of Ion Channel, a data and micro-services platform that automates situational awareness and enables risk-management of the software supply chain. She has 15 years of analytics experience in healthcare and national security. JC was a White House Special Consultant to the Pentagon’s CIO office and co-authored DoD’s Open Technology Development roadmap. A published author, she has contributed to Wired Magazine from 1993 to the present.

Presentations

Confounding Factors Galore: Using Software Ecosystem Data to Risk-Rate Code Session

Automating security for devops means continuous analysis of open source software dependencies, vulnerabilities and ecosystem dynamics. But the data are confounding: a flurry of reported vulnerabilities or infrequent commits could good or a bad signs, depending on a project's scope and life cycle. This talk illuminates nonintuitive insights from the software supply chain

John Hitchingham is Director of Performance Engineering at FINRA, where he is responsible for driving technical innovation and efficiency across a Cloud application portfolio that processes over 75 billion market events per day to detect fraud, market manipulation, insider trading and abuse. Prior to FINRA, John worked at both large and boutique consulting firms providing technical design and consulting services to startup, media, and telecommunications clients. John has a BS in Electrical Engineering from Rutgers University.

Presentations

Cloud Data Lake – Analytic Data Warehouse in the Cloud Session

Gain insights into the design and operation of FINRA's Data Lake in the AWS Cloud where FINRA extracts, transforms, and loads over 75B transactions per day and users can query across Petabytes of data in seconds on AWS S3 using Presto and Spark. All while maintaining security and data lineage.

Passionate about cities, tech, and how they can work together to change the way we live, Vincent is the co-founder and CEO of Local Logic, an information company providing location insights on cities to help travellers, home buyers, and investors make better, more informed decisions. Prior to Local Logic, Vincent worked in real estate development, and has a background in finance and urban planning.

Presentations

Mapping cities through data to model risk in retail & real-estate FinData

The location characteristics of a retail or real estate development dictates the types of customers they attract and the customer experience they deliver. Unfortunately, traditional real-estate and retail industries have not taken advantage of the insight that location characteristics can provide them, instead relying on historic performance data to determine what and where to build.

In 2011 Felipe Hoffa moved from Chile to San Francisco to join Google as a software engineer. Since 2013 he’s been a developer advocate on big data – to inspire developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

What can we learn from 750 billion GitHub events and 42 TB of code Session

With Google BigQuery anyone can easily analyze >5 years of GitHub metadata and >42 terabytes of open source code. We'll cover how to leverage this data - to understand the community and code related to any language or project. Relevant for open source creators, users, and choosers - this is data and methods that you can leverage to make better choices.

John Horcher brings extensive financial markets experience in trading, investment banking & analyst roles. Mr. Horcher has also held senior level roles with firms including SunGard, Business Intelligence Advisors, TIM Group, EDS & Intergraph. John also served as Managing Director of Halpern Capital, where he drove the investor base for research sales & investment banking opportunities including raising over $300 million in equity/debt.

Presentations

Discover Insights in Financial Data with Immersive Reality Session

Pushing the envelope, Immersive Reality enables powerful new Information Design concepts. Most importantly, the new technology enables the telling of powerful stories using more insightful thinking. Our deployments in the financial markets have enabled quicker time to insight, and therefore better decision making.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and spends currently a lot of time writing a book, Stream Processing with Apache Flink.

Presentations

Stream Analytics with SQL on Apache Flink Session

Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage.

Alysa Hutnik delivers comprehensive expertise in all areas of privacy, data security and advertising law. Her experience ranges from counseling to defending clients in FTC and state attorneys general investigations, consumer class actions and commercial disputes. Much of Ms. Hutnik’s practice is focused in the digital and mobile space in particular, including cloud, mobile payment, calling/texting practices and big data-related services.

Ranked as a leading practitioner in the Privacy & Data Security area by Chambers USA, Chambers Global and Law360, Ms. Hutnik has received accolades for the dedicated and responsive service she provides to clients. The US Legal 500 notes that she provides “excellent, fast, efficient advice” regarding data privacy matters. In 2013, Ms. Hutnik was one of just three attorneys under 40 practicing in the area of privacy and consumer protection law to be recognized as a “Rising Star” by Law360.

Presentations

Executive Briefing: Legal Best Practices for Making Data Work Session

Big Data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules-of-the-road are key to maximizing the value of your data, and avoiding data disasters. Alysa Hutnik will provide legal best practices and practical tips to avoid becoming a Big Data “don’t.”

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his main research is in the area of database systems, with special interest in big data, data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a co-founder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of the ACM Transactions of Database Systems (TODS). He is also the recipient of Thomson Reuters Research Chair in Data Quality at the University of Waterloo. He received his PhD in computer science from Purdue University, West Lafayette.

Presentations

Solving Data Cleaning and Unification Using Human Guided Machine Learning Session

Machine-learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab will provide insight into various techniques and discuss how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.

Pramod Immaneni is a PMC member of Apache Apex and lead architect at DataTorrent Inc, where he works on the Apex platform and specializes in big data applications. Prior to DataTorrent he was a founder of technology startups. He was CTO of Leaf Networks, a company he co-founded and was later acquired by Netgear Inc. He built products in the core networking space and holds patents in peer-to-peer VPNs. Before that he was involved in starting a company where he architected a dynamic content customization engine for mobile devices.

Presentations

Building a scalable streaming ingestion application with exactly once semantics using Apache Apex Session

Apache Apex is an open source stream processing platform that runs on Hadoop. Common usages of Apex is in big data ingestion, streaming analytics, ETL, fast batch, real-time actions, threat detection, etc. The talk will go into building an ingestion application with some lightweight etl that is scalable, fault tolerant and has exactly once semantics.

Marta is an expert in product and price testing. She created the Strategic Pricing department at Mondelēz (a Fortune 100 CPG company) which was responsible for new product introductions and pricing across 165 countries. Prior to Mondelez, Marta worked in management consulting at Bain & Company, and was getting her MBA at Stanford’s Graduate School of Business before dropping out to work full-time on Claire.

Presentations

Retail's Panacea: How Machine Learning is Driving Product Development Session

Finally, retailers are leveraging data and its insights across all parts of the supply chain to drive a profound transformation in the retail industry. From crafting assortments based on consumer demand signals, to re-imagining consumer interactions with such products, to receiving and processing customer feedback, modern day retailers have re-imagined the product development process.

Nandu Jayakumar is a software architect and engineering leader at Visa, where he is currently responsible for the long-term architecture of data systems and leads the data platform development organization. Previously, as a senior leader of Yahoo’s well-regarded data team, Nandu built key pieces of Yahoo’s data processing tools and platforms over several iterations, which were used to improve user engagement on Yahoo websites and mobile apps. He also designed large-scale advertising systems and contributed code to Shark (SQL on Spark) during his time there. Nandu holds a bachelor’s degree in electronics engineering from Bangalore University and a master’s degree in computer science from Stanford University, where he focused on databases and distributed systems.

Presentations

Optimizing the Data Warehouse at Visa Session

At Visa, the process of optimizing the enterprise data warehouse and consolidating data marts by migrating these analytic workloads to Hadoop has played a key role in the adoption of the platform and how data has transformed Visa as an organization. This talk will look at Visa’s experience with this process and provide some best practices for organizations migrating workloads to Hadoop.

Dave Kale is a PhD student in computer science, Skymind engineer, and an Alfred E. Mann Innovation in Engineering fellow at the University of Southern California. His research uses machine learning to extract insight from digital data in high-impact domains, including, but not limited to, healthcare. His primary interest is in developing robust methods for learning meaningful representations of multivariate time series, especially using deep learning and related techniques. Dave is advised by Greg Ver Steeg of the USC Information Sciences Institute. He holds a BS in symbolic systems and an MS in computer science from Stanford University. Dave helps organize the annual Meaningful Use of Complex Medical Data (MUCMD) Symposium and is a cofounder of Podimetrics.

Presentations

Securely Building Deep Learning Models for Digital Health Data Tutorial

In this hands-on tutorial, we will teach attendees how to interactively develop and train deep neural networks to analyze digital health data using the Cloudera Workbench and DeepLearning4J (DL4J). Attendees will learn how to use the Workbench to rapidly explore real world clinical data, build data preparation pipelines, and launch training of neural networks.

Supun Kamburugamuve is a computer science Ph.D. Candidate at Indiana University, USA. His research is based on big data applications and frameworks especially focusing on data streaming for real-time data analytics. He is an Apache Software Foundation member and has been contributing to many open source projects including Apache Web Services projects. For his Ph.D., Supun is focusing on large scale machine learning algorithms, data streaming algorithms for robots in the cloud and large scale data visualizations. Recently he has been working on high-performance enhancements to big data systems with HPC interconnect such as Infiniband and Omnipath. Before joining Indiana University, Supun worked on middleware systems and was a key member of developing an open source enterprise service bus which is being used widely for enterprise integrations.

Presentations

Low Latency Streaming: Twitter Heron on Infiniband Session

Modern are data driven and want to move at the speed of light. To achieve real time performance, financial applications use streaming infrastructures for low latency and high throughput. Twitter Heron is an open source streaming engine with low latency around 14ms. In this talk, we present how we ported Heron to Infiniband to achieve latencies as low as 7ms.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

Interactive Data Exploration and Analysis at Enterprise Scale Session

In this talk, we present best practices for building and deploying Hadoop applications to support large scale data exploration and analysis across an organization.

Daniel is a PhD student in the Stanford InfoLab supervised by Peter Bailis and Matei Zaharia. His research interests lie broadly in the intersection of machine learning and systems. Currently, he is working on deep learning applied to video analysis.

Presentations

NoScope: Querying Videos 1000x faster with Deep Learning HDS

Video is one of the fastest-growing sources of data with rich semantic information and advances in deep learning have made it possible query this information with near-human accuracy. However, inference remains prohibitively expensive: the most powerful GPU cannot run the state-of-the-art at real time. In response, we present NoScope, which run queries over video 1000x faster.

Holden Karau is transgender Canadian, active open source contributor, and a Spark committer. She is a co-author of Learning Spark & High Performance Spark (both of which she believes are the gift of whatever season it currently is). When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.

Presentations

Extending Spark ML: Adding your own tools & algorithms Session

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. Even if you don't have your own algorithm to add, this talk will give you a deeper understanding of Spark's ML pipelines.

Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where developed and open sourced techniques for anomaly detection and breakout detection. Prior research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Anomaly Detection on Live Data Session

Services such as YouTube, Netflix and Spotify popularized streaming in different industry segments. Having said that, such services do not center around Live data - as exemplified sensor data - which is going to be the future at large. To this end, we shall walk the audience through how Satori can be leveraged to collect, discover and react to live data feeds at ultra-low latencies.

Modern Real Time Streaming Architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems, algorithms and deployment architectures.

Elsie Kenyon helps lead product management at Nara Logics, an AI platform company. She works with enterprise customers to define product needs and with engineers to define implementations that solve those needs, focusing on data processing and machine learning. Previously, Elsie conducted research on entrepreneurship in the creative industries at Harvard Business School, co-authoring several cases. She holds a BA in American Studies from Yale University.

Presentations

Learning from customers, keeping humans in the loop Session

Enterprises today pursue AI applications to replace logic-based "expert systems", employing AI techniques to learn from customer and operational signals. But, training data can be limited or non-existent, and applying or extrapolating the wrong dataset can have business and reputational costs. We’ll review how to harness institutional, human knowledge to augment data in deployed AI solutions.

Kimoon joined Pepperdata in 2013. Previously, he worked for the Google Search and Yahoo Search teams for many years. Kimoon has hands-on experience with large distributed systems processing massive data sets.

Presentations

HDFS on Kubernetes, lessons learned Session

There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. In this talk, we will demonstrate how to run HDFS inside Kubernetes to speed up Spark.

JJames Kirkland, Chief Architect, Internet of Things, Red Hat

James Kirkland is the advocate for Red Hat’s initiatives and solutions for the Internet of Things (IoT) and is the architect of Red Hat’s strategy for IoT deployments. This open source architecture combines data acquisition, integration, and rules activation with command and control data flows among devices, gateways, and the cloud to connect customers’ operational technology environments with information technology infrastructure and provide agile IoT integration.

James serves as the head subject matter expert and global team leader of system architects responsible for accelerating IoT implementations for customers worldwide. Through his collaboration with customers, partners, and systems integrators, Red Hat has grown its IoT ecosystem, expanding its presence in industries including transportation, logistics, and retail, and accelerating adoption of IoT in large enterprises.

James is a steering committee member of the IoT working group of Eclipse.org, a member of the IIC, and is a frequent public speaker and author of a wide range of technical topics. His extensive knowledge UNIX and Linux variants spans the course of 20 years through his positions at Red Hat, and in previous roles at Racemi and Hewlett-Packard.

Presentations

An Open Source Architecture for IoT Session

Eclipse IoT is an ecosystem of organizations that are working together to establish an IoT architecture based on open source technologies and standards. In this session we'll showcase an end-to-end architecture for IoT based on open source standards highlighting Eclipse Kura, which is an open source stack for gateways and the edge, and Eclipse Kapua, an open source IoT cloud platform.

Olivia Klose (@oliviaklose) is a Software Development Engineer in the Technical Evangelism & Development group at Microsoft. She is focussing on all analytics services on Microsoft Azure, in particular Hadoop (HDInsight), Spark and Machine Learning, and is a frequent speaker at German and international conferences, such as TechEd Europe, PASS Summit and Technical Summit. Prior to joining Microsoft, she studied Computer Science with Mathematics at the University of Cambridge, the Technical University of Munich and IIT Bombay. Here, she focussed on Machine Learning in Medical Imaging.

Presentations

Deploying Deep Learning to Assist the Digital Pathologist Session

KNIME, Apache Spark, and MS Azure are here at work to enable fast, cheap and automated classification of malignant lymphoma type in digital pathology images. The trained model is deployed to end-users as a web application using the KNIME WebPortal.

Experienced IT leader with a demonstrated history of working in the banking industry. Skilled in IT Strategy, Enterprise Architecture, Information Management, Data Governance, Analytics Platforms, Business Intelligence, Integration, Data Architecture, Data Warehousing and Data Platforms.
Member of the Technology Leadership Team
Enterprise Information Strategy delivery
Responsible for leading enterprise data and information management services teams within the bank
Data Management, Database Administration (DBA), Data Modelling and Data Architecture and ETL Integration across data platforms including traditional operational DBMS and analytics data services
Data Research Advisory Board Member
Company NameMIT Sloan Center for Information Systems Research
Dates EmployedJan 2017 – Present Employment Duration less than a year

Presentations

Big Data and The Cloud Down Under - Exec Panel Session

Senior execs from a variety of major companies in Australia and New Zealand including Air New Zealand, Westpac, ANZ and BNZ who have been pioneering the adoption of Big Data technologies like Hadoop will share use cases, challenges and how to be successful Down Under in the geographic location on the opposite side of the world from where technologies like Hadoop got started.

Chi-Yi Kuan is director of business analytics at LinkedIn. He has over 15 years of extensive experience in applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Presentations

The EOI framework for Big Data Analytics to Drive Business Impact at Scale Session

We sincerely hope we will get a great chance to speak at Strata New York, to share our experiences and lessons in big data analytics at LinkedIn. We will present the EOI framework for big data analytics and explain how to leverage this framework to drive and grow business in key corporate functions such as product, marketing, and sales.

Sanjeev Kulkarni is the co-founder of Streamlio focused on building next generation real time stack. Before Streamlio, he was the technical lead for real-time analytics at Twitter where he co-created Twitter Heron. Before that, we was at Locomatix where he handled their engineering stack. Before that he worked in the Adsense team at Google leading several initiatives. He has a MS. in computer science from the University of Wisconsin, Madison.

Presentations

Modern Real Time Streaming Architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems, algorithms and deployment architectures.

Scott Kurth is the vice president of advisory services at Silicon Valley Data Science, where he helps clients define and execute the strategies and data architectures that enable differentiated business growth. Building on 20 years of experience making emerging technologies relevant to enterprises, he has advised clients on the impact of technological change, typically working with CIOs, CTOs, and heads of business. Scott has helped clients drive global technology strategy, conduct prioritization of technology investments, shape alliance strategy based on technology, and build solutions for their businesses. Previously, Scott was director of the Data Insights R&D practice within the Accenture Technology Labs, where he led a team focused on employing emerging technologies to discover the insight contained in data and effectively bring that insight to bear on business processes to enable new and better outcomes and even entire new business models, and led the creation of Accenture’s annual analysis of emerging technology trends impacting the future of IT, Technology Vision, where he was responsible for tracking emerging technologies, analyzing their transformational potential, and using it to influence technology strategy for both Accenture and its clients.

Presentations

Developing a Modern Enterprise Data Strategy Session

Fundamentally, data should serve the strategic imperatives of a business—those key aspirations that will define an organization’s future vision. Conventional data strategy has little to guide us, focusing more on governance than on creating new value. In this tutorial, we explain how to create a modern data strategy that powers data-driven business.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. Jared is the organizer of the New York Open Statistical Programming Meetup and the New York R Conference, as well as an adjunct professor of statistics at Columbia University, in addition to his client-facing consulting and training. Jared specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world. He was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Machine Learning in R Tutorial

Modern statistics has become almost synonymous with machine learning; a collection of techniques that utilize today's incredible computing power. This course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theories behind the curtain, covering the Elastic Net, Boosted Trees and cross-validation.

Francesca Lazzeri is a Data Scientist II at Microsoft, where she is part of the Algorithms and Data Science team. She is passionate about innovations in big data technologies and applications of advanced analytics to real-world problems. Her work focuses on the deployment of machine learning algorithms and web service based solutions to solve real business problems for customers in the energy, retail, and HR analytics sectors.

Presentations

Putting Data to Work: How to Optimize Workforce Staffing to Improve Organization Profitability Session

New machine learning technologies allow companies to apply better staffing strategies by taking advantage of historical data. Assigning the right people to new projects is critical for the success of each project and the overall profitability of an organization. We developed a workforce placement recommendation solution that recommends staff with the best professional profile for new projects.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a PMC Member on Apache Arrow. Julien is an architect at Dremio and was previously the tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

The columnar roadmap: Apache Parquet and Apache Arrow Session

The Hadoop ecosystem has standardized on columnar formats, Apache Parquet on disk and Apache Arrow in memory. Vertical integration from storage to execution improves data access latency by pushing projections and filters to the storage layer. Standards make this more valuable as cross-language programming becomes as fast as native performance without costly translation.

Toni LeTempt
Senior Technical Expert – Wal-Mart Stores, Inc
18 years IT experience, 5 years experience on large secure enterprise Hadoop clusters.
Distributions Supported: Cloudera, Pivotal, HortonWorks.

Presentations

An Authenticated Journey Through Big Data Security at Wal-Mart Session

In today’s world of data breaches and hackers, security is one of the most important components for Big Data systems but unfortunately it is usually the one area least planned and architected. We will walk through one large company’s journey with authentication and give examples of how decisions made early can have significant impact throughout the maturation of your Big Data environment.

Evan Levy is an acknowledged speaker, writer, and consultant in the areas of Enterprise Data Strategy and Data Management. In his current role, Evan advises clients on strategies to address business challenges using data, technology, and creative approaches that align IT with the business capability. Business is experiencing exponential growth in data volumes, sources, and systems – Evan offers practical real world experience on addressing these challenges in a manner that utilizes the company’s existing skills, coupled with new methods to ensure IT and business success.

Evan is a faculty member of TDWI and is a Best Practices judge in areas of Business Intelligence, Data Integration, and Data Management.

Evan is co-author of the first book on MDM: Customer Data Integration: Reaching a Single Version of the Truth, which describes the business breakthroughs achieved with integrated customer data, and explains how to make Master Data Management successful.

Presentations

The five components of a data strategy Session

This presentation identifies the 5 essential components that make up a data strategy and explores the individual attributes of each component. This exploratory approach allows to audience to learn the fundamentals of data strategy along with understanding how the details may differ across organizations of varying complexity (a corporate division, a business unit, or an enterprise).

Michael Li is Head of Analytics at LinkedIn. He is a big data evangelist and practitioner who is working on and defining what Big Data is, what it means for business, and how it can drive business value thru the EOI (Enable/Optimize/Innovate) analytics framework. He is passionate about solving complicated business problems with a combination of superb analytical skills and sharp business instincts through innovation on big data applications. His specialties include hands-on leading of analytics groups, who can build high performing teams quickly to meet the needs for high-paced, growing companies. Utilizing multi-year working experience in Big Data innovation, business analytics, and business intelligence, he has experiences in predictive analytics, fraud detection, analytics, operations, and statistical modeling across financial, e-commerce, and now social network industries.

Presentations

The EOI framework for Big Data Analytics to Drive Business Impact at Scale Session

We sincerely hope we will get a great chance to speak at Strata New York, to share our experiences and lessons in big data analytics at LinkedIn. We will present the EOI framework for big data analytics and explain how to leverage this framework to drive and grow business in key corporate functions such as product, marketing, and sales.

Li, Zhichao is a senior software engineer at Intel focused on distributed machine learning, especially large-scale analytical applications and infrastructure on Spark. He’s also an active contributor of Spark. Before joining Intel, Zhichao worked at Morgan Stanley FX department.

Presentations

Building Advanced Analytics and Deep Learning on Apache Spark with BigDL Session

We’d like to share experiences on building end to end analytics and deep learning applications on top of BigDL and Spark, including speech recognitions, object detection, etc. We’ll also introduce the recent developments in BigDL including Python APIs, notebook and TensorBoard support, TensorFlow model R/W support, better recurrent and recursive net support, 3D image convolutions, etc.

Julia currently works at Metis as a Senior Data Scientist
where she co-teaches the data science bootcamp, develops curricula and focuses on various other special projects.
Prior to Metis she worked as a data scientist at Jetblue, where she used quantitative analysis and machine learning methods to provide continuous assessment of the aircraft fleet. Julia began her career as a structures engineer, where she designed repairs for damaged aircraft. In 2011, she transferred into a quantitative role at JetBlue and began her M.A. in Applied Math at Hunter College, where she focused on visualizations of various numerical methods including collocation and finite element methods. She discovered a deep appreciation for the combination of mathematics and visualizations and found data science to be a natural extension. She continues to collaborate on various projects including the current development of a Trap music generator. During certain seasons of her career, she has also worked on creative side projects such as Lia Lintern, her own fashion label.

Presentations

Take a Deep-Learning Dive via Keras Tutorial

Beginning with basic neural nets and then winding our way through to convolutional neural nets and recurrent neural nets, I will explain both the design theory and the Keras implementation of today's most widely used deep-learning algorithms. As a class, we will work through these deep-learning architectures as well as the corresponding Keras code.

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine-learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Presentations

A Brave New World in Mutable Big Data: Relational Storage Session

To date, mutable big data storage has primarily been the domain of non-relational (NoSQL) systems such as Apache HBase. However, demand for real-time analytic architectures has led big data back to a familiar friend: relationally structured data storage systems. This session explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu.

Ryan (@lippertryan) is a Senior Product Marketing Manager at Cloudera responsible for their Operational Database offering, as well as marketing their storage products. Prior to Cloudera, he held a variety of roles at Cisco Systems. He holds an economics degree from the University of Guelph and an MBA from Stanford.

Presentations

The Sunset of Lambda: New Architectures Amplify IoT Impact Session

A long time ago in a datacenter far, far away...we deployed complex lambda architectures as the backbone of our IoT solutions. Though hard, they enabled collection of real-time sensor data and slightly delayed analytics. Today, the architecture for IoT data has been simplified by Apache Kudu, a relational storage layer for fast analytics on fast data - the key to unlocking the value in IoT data.

Julie Lockner is cofounder of 17 Minds Corporation, a startup focusing on improving care and education plans for children with special needs. She has held executive roles at InterSystems, Informatica, and EMC and was an analyst at ESG. She was founder and CEO of CentricInfo, a data management consulting firm. Julie holds an MBA from MIT and a BSEEfrom WPI.

Presentations

Predicting Tantrums with Wearable Data and Realtime Analytics Session

How can we empower individuals with special needs to reach their full potential? Julie Lockner offers an overview of a project to develop collaboration applications that use wearable device data to improve the ability to develop the best possible care and education plans. Join in to learn how real-time IoT data analytics are making this possible.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Hardcore Data Science welcome HDS

Hardcore Data Science hosts, Ben Lorica and Assaf Araki, welcome you to the day-long tutorial.

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the first day of keynotes.

Hong Lu is a Data Scientist II at Microsoft. Hong is passionate about innovations in big data technologies and application of advanced analytics to real-world problems. During her time at Microsoft, Hong has built end-to-end data science solutions for customers in energy, retail, and education sectors. Before joining Microsoft, she worked on optimizing advertising platforms in the video advertising industry. Hong holds a Ph.D degree in Biomedical Engineering from Case Western Reserve University, with research focus on machine learning based medical image analysis.

Presentations

Putting Data to Work: How to Optimize Workforce Staffing to Improve Organization Profitability Session

New machine learning technologies allow companies to apply better staffing strategies by taking advantage of historical data. Assigning the right people to new projects is critical for the success of each project and the overall profitability of an organization. We developed a workforce placement recommendation solution that recommends staff with the best professional profile for new projects.

Zhenxiao Luo is a senior software engineer at Uber working on Presto and Parquet. Before joining Uber, he led the development and operations of Presto at Netflix. Zhenxiao has big data experience at Facebook, Cloudera, and Vertica on Hadoop-related projects. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.

Presentations

Geospatial Big Data Analysis @ Uber Session

As Uber continues to grow, our Geospatial data grows exponentially. Uber's big data systems need to grow in scalability, reliability, and performance, to support business decisions, user recommendations, and experiment for GeoSpatial data. In this talk, we would like to share our experience about running GeoSpatial analysis efficiently in Big Data Systems, including Hadoop, Hive, and Presto.

Thiru was previously a Distinguished Architect at Yahoo! and held the title of “Principal Hacker” at Altiscale. He was the Architect at Stata Labs where he built a desktop search engine Bloomba. He is a committer and PMC Member of the Apache Avro Project. He has held a number of technical and managerial engineering roles at Accel Limited and Hewlett Packard. Thiru holds a B.E. in Electronics and Communications Engineering from Anna University.

Presentations

SETL: An efficient and predictable way to do Spark ETL Session

Common ETL jobs used for importing log data into Hadoop clusters require considerable amount of resources and that varies based on the input size. Here we present a set of techniques that not only make these jobs much more efficient but also work well with fixes amount of resources. It involves innovative use of Spark processing and exploiting features of Hadoop file formats.

Ted Malaska is a senior solution architect at Blizzard. Previously, he was a principal solutions architect at Cloudera. Ted has 18 years of professional experience working for startups, the US government, some of the world’s largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the largest nonprofit financial regulator in the US and has worked on close to one hundred clusters for over two dozen clients with over hundreds of use cases. He has architecture experience across topics including Hadoop, Web 2.0, mobile, SOA (ESB, BPM), and big data. Ted is a regular contributor to the Hadoop, HBase, and Spark projects, a regular committer to Flume, Avro, Pig, and YARN, and the coauthor of O’Reilly Media’s Hadoop Application Architectures.

Presentations

Architecting a next generation data platform Tutorial

Using the Internet of Things and Customer 360 as an example, we’ll explain how to architect a modern, real-time big data platform leveraging recent advancements in open-source software. We’ll show how components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL along with Apache Hadoop can enable new forms of data processing and analytics.

Bruce Martin is a senior instructor at Cloudera, where he teaches courses on data science, Apache Spark, Apache Hadoop, and data analysis. Previously, Bruce was principle architect and director of advanced concepts at SunGard Higher Education, where he developed the software architecture for SunGard’s Course Signals Early Intervention System, which uses machine-learning algorithms to predict the success of students enrolled in university courses. Bruce’s other roles have included senior staff engineer at Sun Microsystems and researcher at Hewlett-Packard Laboratories. Bruce has written many papers on data management and distributed system technologies and frequently presents his work at academic and industrial conferences. Bruce holds patents on distributed object technologies. Bruce holds a PhD and master’s degree in computer science from the University of California at San Diego and a bachelor’s degree in computer science from the University of California, Berkeley.

Presentations

Cloudera Big Data Architecture Workshop 2-Day Training

The Cloudera Big Data Architecture Workshop (BDAW) is a 2-day leaning event that addresses advanced big data architecture topics. BDAW brings together technical contributors into a group setting to design and architect solutions to a challenging business problem. The workshop addresses big data architecture problems in general, and then applies them to the design of a challenging system.

Cloudera Big Data Architecture Workshop (Day 2) Training Day 2

The Cloudera Big Data Architecture Workshop (BDAW) is a 2-day leaning event that addresses advanced big data architecture topics. BDAW brings together technical contributors into a group setting to design and architect solutions to a challenging business problem. The workshop addresses big data architecture problems in general, and then applies them to the design of a challenging system.

Hilary Mason is founder and CEO of Fast Forward Labs, a machine intelligence research company, and data scientist in residence at Accel Partners. Previously Hilary was chief scientist at Bitly. She cohosts DataGotham, a conference for New York’s homegrown data community, and cofounded HackNY, a nonprofit that helps engineering students find opportunities in New York’s creative technical economy. Hilary served on Mayor Bloomberg’s Technology Advisory Board and is a member of Brooklyn hacker collective NYC Resistor.

Presentations

Executive Briefing: Talking to Machines: Natural Language Today Session

Progress in machine learning has allowed us to imagine that we might soon be able to build machines that talk to us using the same interfaces that we use to talk to each other -- natural language. In this talk we'll explore how close we are to that ideal, and walk through the current state of natural language technologies.

https://www.linkedin.com/in/mcallistertony/

Presentations

Implementing Hadoop to Save Lives Session

At the National Marrow Donor Program (Be the Match), we have moved our core transplant matching platform onto Cloudera Hadoop. We will discuss why we chose to migrate our platform to Cloudera hadoop and many of our big data goals. Our ultimate goals are to increase the number of donors/matches, make the process more efficient, and also make transplants more effective.

Michael McCune is a software developer in Red Hat’s emerging technology group. He is an active contributor to several radanalytics.io projects, as well as being a core reviewer for the OpenStack API Working Group. Since joining Red Hat three years ago, he has been developing and deploying applications for cloud platforms. Prior to his career at Red Hat, Michael developed Linux based software for embedded global positioning systems.

Presentations

From notebook to cloud native, a modern path for data driven applications Session

Notebook interfaces like Apache Zeppelin and Project Jupyter are excellent starting points for sketching out ideas and exploring data driven algorithms, but where does the process lead after the notebook work has been completed? In this session we will explore the answers as they relate to cloud native platforms.

Matteo Merli is a software engineer at Streamlio working on messaging and storage technologies. Prior to Streamlio, he has spent several years at Yahoo building database replication systems and multi-tenant messaging platforms. He was the architect and lead developer for Pulsar and a member of the PMC of Apache BookKeeper.

Presentations

Messaging, Storage or Both: The Real Time Story of Pulsar and Apache DistributedLog Session

Modern enterprises not only produce data at volume but also at a high velocity. To keep up with the velocity and also process data in real time, new type of storage systems are designed, implemented and deployed. This talk focuses on Apache DistributedLog and Pulsar which are real time storage systems built using Apache Bookkeeper and used heavily in production.

Chris has been coding since grade school. Unable to choose between science and engineering, he has worked on projects in genetics, natural language processing, distance learning, content syndication, automated categorization, and recommender systems. He loves games and puzzles of all sorts, and thinks that the intersection of Big Data and human behavior offers some of the very best puzzles available.

He is currently runs the Big Data team for if(we).

Presentations

Lessons from an AWS Migration Session

Our batch event processing pipeline is different from yours, but the process of migrating it from running in our data center to running in AWS is likely to be pretty similar. This talk will cover what was easier than expected, what was harder, and what we wished we had known before we started the migration.

Harjinder Mistry is currently a member of Developer-Tools team in RedHat, where he is incorporating data science into next-generation developer tools powered by Spark. Prior to RedHat, he was a member of IBM Analytics team and he developed Spark-ML pipeline components of IBM Analytics Platform. Earlier, he had spent several years in DB2 SQL Query Optimizer team building and fixing the mathematical model that decides the query execution plan. He holds M.Tech. degree from IIIT, Bangalore, India.

Presentations

AI-driven Next Generation Developer Tools Session

This talk is about how Machine Learning and Deep Learning techniques are helping Red Hat build smart developer tools to make software developers become more efficient.

Robin Moffatt is a Partner Technology Evangelist for Confluent. He loves working with Apache Kafka. Previously he worked on large RDBMS data warehousing projects and consultancy. His particular interests are data and analytics, systems architecture, performance testing and optimisation. He blogs at confluent.io and rmoff.net and can be found tweeting grumpy geek thoughts as @rmoff.

Presentations

One cluster does not fit all: Architecture patterns for multicluster Apache Kafka deployments Session

There are many good reasons to run more than one Kafka cluster…and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Robin Moffatt offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions to help you better choose the right architecture for your needs.

Most of what I know I have taught myself, with much thanks to the online communities I was active in from 2003-2007 when the internet was much smaller. I have been programming and making games since I was 12, creating short films and composing music since I was 14, and trading stocks since I was 18. At 21, I created Tech Trader, a fully autonomous system that traded and understood the markets the same way I did, and at 24, I created a fully autonomous hedge fund run and managed by that system.

Presentations

Findata session with William Mok FinData

William Mok, The Tech Trader Fund)

Karen is Co-founder and CEO of Trendalytics, a product intelligence engine that measures consumer engagement with merchandise trends. With over twelve years of experience in retail and technology, Karen worked with companies across the supply chain – department stores, luxury retailers and independent designers. At Goode Partners, she executed the firm’s investment in SkullCandy (NASDAQ: SKUL) and worked on the turnaround of a luxury specialty retailer. Previously she worked in Gap Inc.’s Corporate Strategy group, where she assessed acquisition and new retail concept opportunities such as Piperlime.com. Karen started her career in investment banking at Goldman Sachs & Co. where she executed over $1 billion in technology and media transactions.

Karen holds an M.B.A. from Harvard Business School and a B.A. from UCLA where she graduated Summa Cum Laude. Her research at Harvard included studies in multi-channel retailing, luxury diffusion brands and supply chain innovation for emerging designers.

She’s been featured on the Wall Street Journal, WWD, Sourcing Journal, Forbes and other publications.

Speaking Engagements: Strata NY 2014 keynote, NRF, Shoptalk, Luxury Forward, FortuneTech Brainstorm, Fashion Digital, Harvard Business School Retail & Luxury Goods Conference, Decoded Fashion, IFB Conference

Presentations

Retail's Panacea: How Machine Learning is Driving Product Development Session

Finally, retailers are leveraging data and its insights across all parts of the supply chain to drive a profound transformation in the retail industry. From crafting assortments based on consumer demand signals, to re-imagining consumer interactions with such products, to receiving and processing customer feedback, modern day retailers have re-imagined the product development process.

Neha Narkhede is the cofounder and CTO at Confluent, a company backing the popular Apache Kafka streaming platform. Prior to founding Confluent, Neha led streams infrastructure at LinkedIn, where she was responsible for LinkedIn’s petabyte-scale streaming infrastructure built on top of Apache Kafka and Apache Samza. Neha specializes in building and scaling large distributed systems and is one of the initial authors of Apache Kafka. A distributed systems engineer by training, Neha works with data scientists, analysts, and business professionals to move the needle on results.

Presentations

The three realities of modern programming: cloud, microservices, and the explosion of data Session

Learn how the three realities of modern programming – the explosion of data and data systems, building business processes as microservices instead of monolithic applications and the rise of the public cloud – affect how developers and companies operate today and why companies across all industries are turning to streaming data and Apache Kafka for mission-critical applications.

Paco Nathan leads the Learning Group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, cloud computing, and peer teaching, with 30+ years of tech-industry experience ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the Top 30 People in Big Data and Analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

PyTextRank: graph algorithms for enhanced natural language processing Session

PyTextRank is an open source Python implementation of TextRank, a graph algorithm for NLP based on the Mihalcea 2004 paper. Its builds atop spaCy, datasketch, networkX, and other popular libraries to prepare raw text for AI applications in media and learning. Level-up beyond outdated techniques such as stemming, ngrams, bag-of-words, etc., while performing advanced NLP on single-server solutions.

Chris Neumann is big data veteran who has been at the forefront of data innovation for more than 10 years. He was the founder and CEO of DataHero (recently acquired by Cloudability), which brought to market the first self-service cloud BI platform. Previously, he was the first employee at Aster Data Systems (acquired by Teradata), where he helped create the big data space.

Presentations

Accelerating the Next Generation of Data Companies Session

This panel brings together partners from some of the world’s leading startup accelerators with founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.

Presentations

Building your first big data application on AWS Tutorial

Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on the cloud? Join us for this hands-on workshop as we build a big data application. We will use a combination of open source technologies such as Apache Spark, and Zeppelin; as well as AWS managed services such as Amazon EMR, Amazon Kinesis, and more. Get best practices & design patterns

Lu is a software engineer working in Hadoop Infrastructure & Analytics @ Uber. He is working on big data SQL engines, including Presto and Hive. Before joining Uber, Lu was working in Data and Ads team @ Yahoo, where he designed and implemented Yahoo’s new generation user profile platform. Lu holds Masters degree from the University of Southern California, and Bachelors degree from Sun Yat-Sen University.

Presentations

Geospatial Big Data Analysis @ Uber Session

As Uber continues to grow, our Geospatial data grows exponentially. Uber's big data systems need to grow in scalability, reliability, and performance, to support business decisions, user recommendations, and experiment for GeoSpatial data. In this talk, we would like to share our experience about running GeoSpatial analysis efficiently in Big Data Systems, including Hadoop, Hive, and Presto.

Brian O’Neill (@rhythmspice) has been designing useful, usable, and beautiful products for the web since 1996. Currently, Brian focuses on helping companies design indispensable analytics software products that customers love via his consultancy, Designing for Analytics. His clients and past employers include DELL/EMC, NetApp, Tripadvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JP Morgan Chase, the Future of Music Coalition, and ETrade among others, and he has worked on award-winning storage industry software for Akorri and Infinio. Around 2010, Brian co-founded the adventure travel company, TravelDragon.com, and he has invested in several Boston-area startups as well. When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage as a professional percussionist and drummer. He leads the acclaimed dual-ensemble, Mr. Ho’s Orchestrotica that is “anything but straightforward” (Washington Post) and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival.

Presentations

Design for Non-Designers: Increasing Revenue, Usability, and Utility Within Data Analytics Products Session

Do you spend a lot of time explaining your data analytics product to your customers? Is your UI/UX or navigation overly complex? Are sales suffering due to complexity, or worse, customers aren't using your product that much? If so, your design may be your problem. However, here's a secret: you don't have to be a trained designer to recognize design and UX problems and start to correct them today.

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting A Data Platform Tutorial

What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop, Spark and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Senior Manager, Development Corporate Information Services at TD
An experienced and innovative Information Technology Professional with strengths on architecting, managing and delivering data solutions across multiple business lines with over 20 years of progressive experience.
Strengths:
• People management; team management, development of internal talent and recruiting talent.
• Implementation of Big Data – “Hadoop” solutions, Business Intelligence, database design and project management.
• Extensive knowledge of information systems design and integration, web information delivery and systems automation

Presentations

Griffin – Fast-tracking model development in Hadoop Session

Griffin is a high-level, easy-to-use framework built on top of Spark which encapsulates the complexities of common model development tasks within four phases – data understanding, feature extraction, model development and serving modelling results.

Andrew Otto is a systems engineer at the Wikimedia Foundation, where he supports the Analytics team by architecting
and maintaining small and big data analytics infrastructure. Previously he was the lead systems administrator at
CouchSurfing.org. He is based in Brooklyn, NY and spends too much time playing hardcourt bike polo.

Presentations

Analytics at Wikipedia Session

The Wikimedia Foundation is a non-profit and charitable organization and the parent company of Wikipedia. As one of the most visited web sites in the world, we face many unique challenges around better understanding our ecosystem of editors, readers, and content . In this session, we will discuss how we do analytics at the WMF, and cover the technology we use for our data.

Shoumik Palkar is a second-year PhD student in the Infolab at Stanford University, working with Matei Zaharia on high-performance data analytics. He holds a degree in electrical engineering and computer science from UC Berkeley.

Presentations

Weld: Accelerating Data Science by 100x Session

Modern data applications combine functions from many optimized libraries (e.g., Pandas and TensorFlow), and yet do not achieve peak hardware performance due to data movement across functions. Weld is a new interface to implement functions in these libraries while enabling optimizations across them. Weld can be integrated into libraries such as Pandas or Spark SQL with no changes to user code.

Lloyd Palum is the CTO of Vnomics, where he directs the company’s technology development associated with optimizing fuel economy in commercial trucking. Lloyd has more than 25 years of experience in both commercial and government electronics, has published a number of technical articles and speaks frequently at industry conferences. He holds five patents in the field of software and wireless communications. Lloyd earned his MSEE from Boston University and BSEE from the University of Rochester.

Presentations

How to Build a Digital Twin (a tutorial) Session

If the performance of your business depends on the efficiency of machines then you should employ Digital Twins. In this tutorial we will walk through a process of building a tractor trailer big rig twin using Python and TensorFlow. The example model will then be used to show how performance can be tracked and optimized using this technology.

Gene Pang is a software engineer at Alluxio. Previously, he worked at Google. Gene recently earned his PhD from the AMPLab at UC Berkeley, working on distributed database systems, and holds an MS from Stanford University and a BS from Cornell University.

Presentations

Best Practices for using Alluxio with Spark Session

Alluxio (formerly Tachyon) is a memory speed virtual distributed storage system and leverages memory for managing data across different storage. Many deployments use Alluxio with Spark because Alluxio helps Spark be more effective and further accelerate applications. We discuss how Alluxio helps Spark be more effective and describe production deployments of Alluxio and Spark working together.

Kevin Parent is the CEO of Conduce. He is an innovator, whose entire career has focused on connecting the dots between advances in technology and human experiences. With Conduce, leaders and teams can see and interact with all their data instantly using a single, intuitive human interface.

Before founding Conduce, Kevin co-founded Oblong Industries, where he invented new–to–world interfaces that allow users to interact with software using displays, gestures, wands, tablets and smart phones.
Prior to Oblong, Kevin spent ten years engineering theme park attractions. At Walt Disney Imagineering he was a Project Engineer for the Twilight Zone Tower of Terror.
Kevin holds six patents and a degree in physics from the Massachusetts Institute of Technology. His undergraduate thesis work was conducted in MIT’s Media Lab.

Presentations

Seeing everything so managers can act on anything: IoT in DHL Supply Chain operations Data 101

DHL has created an IoT initiative for its supply chain warehouse operations. Utilizing immersive operational data visualization, DHL is gaining unprecedented insight -- from the most comprehensive global view across all locations, down to zooming in on a unique data feed from a single sensor -- to see, understand and act on everything that occurs in its warehouses

Mo Patel, based in Austin Texas, is a practicing Data Scientist at Teradata. In his role as the Practice Director, Mr. Patel is focused on building the Artificial Intelligence & Deep Learning consulting practice via mentoring and advising Teradata clients and providing guidance on ongoing Deep Learning projects. Mo Patel has successfully managed and executed Data Science projects with clients across several industries, notably Major Cable Company, Major Auto Manufacturer, Major Medical Devices Manufacturer, Leading Technology Firm and Major Car Insurance Provider. A continuous learner, Mr. Patel conducts research on applications of Deep Learning, Reinforcement Learning, and Graph Analytics towards solving existing and novel business problems.

Presentations

Training Recommendation Models Tutorial

Learn to apply Deep Learning to improve consumer recommendations. We train neural nets to learn categories of interest for recommendations (e.g., for cold start) using embeddings. Learn how to extend this with WALS Matrix Factorization to achieve Wide & Deep Learning - which is now used in production for the Google Play store. Learn with TensorFlow on our cloud GPU (or bring your own GPU laptop).

Josh Patterson is the director of field engineering for Skymind. Previously, Josh ran a big data consultancy, worked as a principal solutions architect at Cloudera, and was an engineer at the Tennessee Valley Authority, where he was responsible for bringing Hadoop into the smart grid during his involvement in the openPDC project. Josh is a graduate of the University of Tennessee at Chattanooga with a master of computer science, where he did research in mesh networks and social insect swarm algorithms. Josh is a cofounder of the DL4J open source deep learning project and is a coauthor on the upcoming O’Reilly title Deep Learning: A Practitioner’s Approach. Josh has over 15 years’ experience in software development and continues to contribute to projects such as DL4J, Canova, Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif.

Presentations

Realtime Image Classification : Using Convolution Neural Networks on realtime streaming data. Session

Enterprises building data lakes often have to deal with very large volumes of image data that they have collected over the years. In this session we will talk about how some of the most sophisticated big data deployments are using Convolution Neural Nets to automatically classify images to add rich context about the content of the image, in realtime, while ingesting data at scale

Securely Building Deep Learning Models for Digital Health Data Tutorial

In this hands-on tutorial, we will teach attendees how to interactively develop and train deep neural networks to analyze digital health data using the Cloudera Workbench and DeepLearning4J (DL4J). Attendees will learn how to use the Workbench to rapidly explore real world clinical data, build data preparation pipelines, and launch training of neural networks.

Joshua Patterson is the Director of Applied Solutions Engineering at NVIDIA and a former White House Presidential Innovation Fellow. Prior to NVIDIA, Josh worked with leading experts across public sector, private sector, and academia to build a next generation cyber defense platform. His current passions are graph analytics, machine learning, and GPU data acceleration. Josh also loves storytelling with data, and creating interactive data visualizations. Josh holds a B.A. in economics from the University of North Carolina at Chapel Hill and an M.A. in economics from the University of South Carolina Moore School of Business.

Presentations

Training a Deep Learning Risk Detection Platform Session

Learn how to bootstrap your own Deep Learning Framework to detect risk and threats in production operational systems using best of breed GPU-accelerated open source tools.

Nick is a Principal Engineer at IBM, working primarily on machine learning and Apache Spark. He is a member of the Apache Spark PMC and author of Machine Learning with Spark.

Previously, he co-founded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match and Mxit.

He is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Deep Learning for Recommender Systems Session

In the last few years, deep learning has achieved significant success in a wide range of domains including computer vision, artificial intelligence, speech, NLP and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. This talk will explore recent advances in this area in both research and practice.

Frances Perry is a software engineer who likes to make big data processing easy, intuitive, and efficient. After many years working on Google’s internal data processing stack, Frances joined the Cloud Dataflow team to make this technology available to external cloud customers. She led the early work on Dataflow’s unified batch/streaming programming model and is on the PMC for Apache Beam.

Presentations

Realizing the promise of portability with Apache Beam Session

Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. Come learn the basics of Beam and see the portability in action.

With a background in high performance computing, data warehousing, and distributed systems, Mike Pittaro, Distinguished Engineer on Dell EMC’s Open Source Solutions team, specializes in designing and developing big data solutions. He has held engineering and service positions at Alliant Computer, Kendall Square Research, Informatica, and SnapLogic.

Presentations

Considerations for hardware accelerated machine learning platforms Session

The advances we see in machine learning would be impossible without hardware improvements, but building a high performance hardware platform is tricky. It involves hardware choices, an understanding of software frameworks, algorithms, and how they interact. This session will help you learn the secrets of matching the right hardware and tools to the right algorithms for optimal performance.

I am a financial technologist specializing in front-end development, mostly for trading and analytics applications. I have worked on a wide variety of UI technologies in the past, ranging from Java Swing, Eclipse SWT, and Nokia Qt to Cocoa on OSX/iOS, .Net WPF, and HTML5. I am the author of WPF Control Development Unleashed with Addison/Wesley-SAMS. I am also the creator of QuickLens, a Mac App targeted at UI Designers and Developers.

Presentations

Expanding Data Literacy with Data Visualizations Session

While the value of data and its role in informing decisions and communications is well known, the true meaning is often lost or incorrectly interpreted without appropriate data visualizations that provide context and accurate representation of the underlying numbers. In this talk you will learn new data visualizations to use for your own analysis and presentations to others.

Adrian Popescu is a data engineer at Unravel Data Systems working on performance profiling and optimization of Spark applications. He has 8+ years of experience in building and profiling data management applications. His recent PhD thesis focused on modeling the runtime performance of a class of analytical workloads that include iterative tasks executing on in-memory graph processing engines (Giraph BSP), and SQL queries executing at scale on Hive. He holds a PhD in Computer Science from EPFL, a Masters of Applied Science from University of Toronto, and a Bachelor of Science from University Politehnica of Bucharest.

Presentations

Using ML to solve failure problems with ML and AI apps in Spark Session

A roadblock in the agility that comes with Spark is that application developers can get stuck with application failures, and have a tough time finding and resolving the issue. To address this roadblock, we have been working closely with Spark application developers to automatically identify and alleviate the root cause of application failures using ML techniques.

Presentations

Data science and e-Sports DCS

Sean Power, Repable

Greg is a director of product management, responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over 20 years, Greg has worked with relational database systems across a variety of roles – including software engineering, database administration, database performance engineering, and most recently product management – providing a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Rethinking Data Marts in the Cloud: Common Architectural Patterns for Analytics Session

Whether you’re deployed today or just starting to consider it, cloud environments will likely play a key role in your business’s future. We will discuss the workload considerations when evaluating the cloud for analytics and discuss the common architectural patterns to optimize price and performance.

Karthik Ramasamy is the co-founder of Streamlio that focuses on building next generation real time processing engines. Before Streamlio, he was the engineering manager and technical lead for real-time analytics at Twitter where he co-created Twitter Heron. He has two decades of experience working in parallel databases, big data infrastructure, and networking. He cofounded Locomatix, a company that specializes in real-time streaming processing on Hadoop and Cassandra using SQL, that was acquired by Twitter. Before Locomatix, he had a brief stint with Greenplum, where he worked on parallel query scheduling. Greenplum was eventually acquired by EMC for more than $300M. Prior to Greenplum, Karthik was at Juniper Networks, where he designed and delivered platforms, protocols, databases, and high availability solutions for network routers that are widely deployed on the internet. Before joining Juniper, at the University of Wisconsin he worked extensively in parallel database systems, query processing, scale out technologies, storage engines, and online analytical systems. Several of these research projects were later spun off as a company acquired by Teradata.

Karthik is the author of several publications, patents, and Network Routing: Algorithms, Protocols and Architectures. He has a Ph.D. in computer science from the University of Wisconsin, Madison with a focus on big data and databases.

Presentations

Low Latency Streaming: Twitter Heron on Infiniband Session

Modern are data driven and want to move at the speed of light. To achieve real time performance, financial applications use streaming infrastructures for low latency and high throughput. Twitter Heron is an open source streaming engine with low latency around 14ms. In this talk, we present how we ported Heron to Infiniband to achieve latencies as low as 7ms.

Modern Real Time Streaming Architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems, algorithms and deployment architectures.

Twitter Heron Goes Exactly Once Session

Twitter processes billions of events per day at the instant the data is generated. To achieve real-time performance, Twitter employs Heron, an open-source streaming engine tailored for large-scale environments. In this talk, Karthik will present the techniques used by Heron to implement exactly once and share the operating experiences at scale.

Jun Rao is the cofounder of Confluent, a company that provides a stream data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM’s Almaden research data center, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Apache Cassandra.

Presentations

Apache Kafka Core Internals: A Deep Dive Session

In the last few years, Apache Kafka is a streaming platform and has been used extensively in enterprises for real-time data collecting, delivering, and processing. This talk will provide a deep dive on some of the key internals that help make Kafka popular and provide strong reliability guarantees.

Sneha Rao is an experienced Product Owner with a demonstrated history of working with Big data at scale at Spotify, New York Times, Comcast/NBCU, NASA data center. Ms. Rao is a strong engineering professional skilled in Database Management, Big Data, Analytics, and Python and is currently pursuing a Master of Business Administration (MBA) focused in Innovation, Design and Entrepreneurial Studies from New York University – Leonard N. Stern School of Business.

Sneha Rao
Senior Product Owner, Data Mission, Spotify
Tel. +1-703-839-2898
Email: sneharao@spotify.com
Website: www.spotify.com

Presentations

Managing core data entities for internal customers at Spotify Session

At Spotify we make data-driven product decisions. As we grow as a company, the magnitude and complexity of the data we care for the most is growing at a rapid pace. During this 40-min presentation, we will walk you through how we store and expose audience data created from multiple internal producers to consumers within Spotify.

Pranav Rastogi is a Microsoft Program Manager with the Azure HDInsight team. Azure HDInsight is the only fully-managed cloud Hadoop offering. Pranav spends most of his time in making it easier for customers to leverage the big data ecosystem to build big data solutions faster.

Presentations

Building your big data applications on Azure Tutorial

Big data solutions are rapidly moving to the cloud and it is getting ever so important to learn about using Apache Hadoop, Spark, R Server and other open source technologies in the cloud. In this tutorial we will walk you through building big data applications on Azure HDInsight and other Azure services.

Alex Ratner is a 3rd-year PhD student advised by Chris Re at the Stanford InfoLab, where he works on new machine learning paradigms for settings where limited or no hand-labeled training data is available, motivated in particular by information extraction problems in domains like genomics, clinical diagnostics, and political science. He co-leads the development of the Snorkel framework for lightweight information extraction (snorkel.stanford.edu).

Presentations

Data Programming: Creating Large Training Sets, Quickly HDS

With ever more data-hungry algorithms becoming the norm in machine learning, getting labeled training data has become the bottleneck. In this talk I will describe a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as simple scripts called labeling functions, which are then automatically denoised.

Radhika Ravirala is a solutions architect at Amazon Web Services, where she helps customers craft distributed, robust cloud applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. Radhika enjoys spending time with her family, walking her dog, doing Warrior X-Fit, and playing an occasional hand at Smash Bros.

Presentations

Building your first big data application on AWS Tutorial

Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on the cloud? Join us for this hands-on workshop as we build a big data application. We will use a combination of open source technologies such as Apache Spark, and Zeppelin; as well as AWS managed services such as Amazon EMR, Amazon Kinesis, and more. Get best practices & design patterns

Dario Rivera is a solutions architect at Amazon Web Services, where he helps customers to get the most out of AWS. A 20-year IT veteran, Dario has also worked widely within the public sector, holding positions within the DOD, FBI, DHS, and DEA. From highly available, scalable, and elastic architectures to complex enterprise systems with zero down-time availability, Dario is always on the lookout for a challenge to change the world through customer success. Dario has presented at conferences and venues around the world, including Re:Invent, Strata + Hadoop World, HIMMS, and Oxford University.

Presentations

Building your first big data application on AWS Tutorial

Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on the cloud? Join us for this hands-on workshop as we build a big data application. We will use a combination of open source technologies such as Apache Spark, and Zeppelin; as well as AWS managed services such as Amazon EMR, Amazon Kinesis, and more. Get best practices & design patterns

Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.

Presentations

Rethinking Data Marts in the Cloud: Common Architectural Patterns for Analytics Session

Whether you’re deployed today or just starting to consider it, cloud environments will likely play a key role in your business’s future. We will discuss the workload considerations when evaluating the cloud for analytics and discuss the common architectural patterns to optimize price and performance.

Matthew is a Senior Program Manager in Microsoft’s Cloud + Enterprise group, with a focus on Enterprise Information Management, crowdsourced metadata, and data source discovery. Matthew currently delivers capabilities in Azure Data Catalog and has previously worked on Power BI, SQL Server Integration Services, Master Data Services, and Data Quality Services. When not enabling the world to get more value from its data, he enjoys reading, baking, and competitive longsword combat.

Presentations

Building a Rosetta Stone for business data Session

The data-driven business must bridge the language gap between data scientists and business users. In this talk, Matthew Roche and Jennifer Stevens walk through how to build a business glossary that codifies your semantic layer and enables greater conversational fluency between business users and data scientists.

Matthew Rocklin is an open source software developer focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today leads development of Dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.

Presentations

Dask: Flexible Parallelism in Python for Advanced Analytics Session

An overview of Dask, a distributed system for advanced analytics in Python that extends popular data science libraries like NumPy, Pandas, Scikit-Learn, and others natively to run at scale. Additionally a discussion of computational task scheduling, and a discussion of parallel computing within Python generally.

Scaling Python Data Analysis Tutorial

The Python Data science stack (NumPy, Pandas, Scikit-Learn) is efficient and intuitive but only for in-memory data and a single core. This tutorial teaches you to parallelize and scale your Python workloads to multi-core machines and multi-machine clusters. We use a variety of tools. This comparative approach encourages us to think broadly about parallel tools and programming paradigms.

Julie Rodriguez is associate creative director at Sapient Global Markets. Julie is an experience designer focusing on user research, analysis, and design for complex systems. Julie has patented her work in data visualizations for MATLAB, compiled a data visualization pattern library, and publishes industry articles on user experience and data analysis and visualization. She is the coauthor of Visualizing Financial Data, a book about visualization techniques and design principles that includes over 250 visuals depicting quantitative data.

Presentations

Data visualizations decoded Data 101

Designing data visualizations presents unique and interesting challenges: how to tell a compelling story, how to deliver important information in a forthright, clear format, and how to make visualizations beautiful and engaging. Julie Rodriguez shares a few disruptive designs and connects them back to Vizipedia, her compiled data visualization library.

Expanding Data Literacy with Data Visualizations Session

While the value of data and its role in informing decisions and communications is well known, the true meaning is often lost or incorrectly interpreted without appropriate data visualizations that provide context and accurate representation of the underlying numbers. In this talk you will learn new data visualizations to use for your own analysis and presentations to others.

Steve Ross focuses on product management for security across the Hadoop ecosystem, championing the interests of users and IT teams working to get the most out of Hadoop, while complying with the demands of Info Security and compliance requirements. Prior to his current position at Cloudera, at RSA Security and Voltage Security, Steve managed product portfolios now in use by the largest global companies and over a hundred million users.

Presentations

GDPR: Getting Your Data Ready for Heavy New EU Privacy Regulations Session

General Data Protection Regulation (GDPR) will go into effect in May 2018 for firms doing any business in the EU. However, many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance as well as future regulations.

ESG Senior Analyst Nik Rouda covers big data, analytics, machine learning, and artificial intelligence. With 20+ years of experience in IT around the world, he understands the challenges of both vendors and buyers, and how to find the true value of innovative technologies. Using the knowledge he gathered previously helping to accelerate growth for fast-paced startups and Fortune 100 enterprises, Nik’s goal is to advise businesses on how to design their analytics strategy for maximum gain.

Presentations

Executive Briefing: AI and machine learning with Nik Rouda Session

This Executive Briefing is a part of the Strata Business Summit. Details to come.

Edgar Ruiz is a solutions engineer at RStudio with a background in deploying enterprise reporting and business intelligence solutions. He is the author of multiple articles and blog posts sharing analytics insights and server infrastructure for data science. Recently, Edgar authored the “Data Science on Spark using sparklyr” cheat sheet.

Presentations

Using R and Spark to Analyze Data on Amazon S3 Session

With R and sparklyr, a Spark Standalone cluster can be used to analyze large datasets found in S3 buckets. Edgar Ruiz will present and demo the recommendations for the following: Setting up a Spark Standalone cluster using EC2 S3 bucket folder and file setup Connecting R to Spark Settings needed to read S3 data into Spark Data Import and Wrangle approach

Philip Russom is a well-known figure in data warehousing, business intelligence, data management, and analytics, having published 550+ research reports, magazine articles, opinion columns, speeches, and Webinars. Today, he’s the Research Director for Data Management at TDWI, where – as an industry analyst – he oversees many of the company’s research-oriented publications, services, and events.

Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors.

You can reach him at prussom@tdwi.org, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.

Presentations

The Data Lake: Improving the Role of Hadoop in Data-Driven Business Management Session

About the time Hadoop turned 10, long-time users realized they lack viable methods for it. Mature IT groups continue to be appalled by the governance-free data dumping and lack of audit trail common with Hadoop. And business users are frustrated by the low value and trust they get from Hadoop data. Now, they're all turning to the data lake, which promises to improve Hadoop usage and value.

Neelesh Srinivas Salian is a Software Engineer on the Data Platform Infrastructure team in the Algorithms group at Stitch Fix, where works closely with the Apache Spark ecosystem. Previously, he worked at Cloudera where he was working with Apache projects like YARN, Spark, and Kafka. He holds a Master’s degree in Computer Science from North Carolina State University, USA with a focus on cloud computing and a Bachelor’s degree in Computer Engineering from the University of Mumbai, India.

Presentations

Apache Spark in the hands of Data Scientists Session

This is a talk to help understand the development and the working of the Data Platform used by Data Scientists at Stitch Fix. The usability of the Spark ecosystem forms the basis of this platform. Data Scientists are the main users of this platform and this talk describes how we built a platform to help them and what we learned in doing so.

Majken Sander is a data nerd, business analyst, and solution architect at TimeXtender. Majken has worked with IT, management information, analytics, BI, and DW for 20+ years. Armed with strong analytical expertise, She is keen on “data driven” as a business principle, data science, the IoT, and all other things data.

Presentations

Show me my data and I’ll tell you who I am Session

Increasingly more personal data is spread across various services globally. But what do companies know about me? And how do we enable collecting that knowledge, getting hold of our own data and maybe even correct faulty perceptions and putting the right answers out there as a service? I desperately need to build myself a personal Discovery Hub – the GoTo place for knowledge about me.

Sri is co-founder and ceo of H2O (@h2oai), the builders of H2O. H2O democratizes big data science and makes hadoop do math for better predictions. Before H2O, Sri spent time scaling R over big data with researchers at Purdue and Stanford. Prior to that, Sri co-founded Platfora and was the Director of Engineering at DataStax. And before that Sri was Partner & Performance engineer at the java multi-core startup, Azul Systems, tinkering with the entire ecosystem of enterprise apps at scale.

Even before that Sri was at sabbatical pursuing theoretical neuroscience at Berkeley. And prior to that Sri worked on nosql trie based index for semistructured data at the in-memory index startup RightOrder.

Sri is known for his knack for envisioning killer apps in fast evolving spaces and assembling stellar teams towards productizing that vision. A regular speaker in the BigData, NoSQL and Java circuit, Sri leaves a trail @srisatish.

Presentations

Interpretable AI is not just for regulators! Session

Interpreting machine learning models is not just another regulatory burden to be overcome. Practitioners, researchers, and consumers that use these technologies in their work and their day-to-day lives have the right to trust and understand AI. This talk is an overview of techniques for interpreting machine learning models and telling stories from their results

Andrei Savu is a software engineer at Cloudera, where he’s working on making data processing at scale easy in the cloud.

Presentations

A Deep Dive into Running Data Engineering Workloads in AWS Tutorial

Data engineering workloads are foundational workloads run prior to most analytic and operational database use cases. This hands-on tutorial will provide a deep dive into running data engineering workloads in a managed service capacity in the public cloud; highlight AWS infrastructure best practices; and discuss how data engineering workloads interoperate with data analytic workloads.

Jared Schiffman has worked at the intersection of design, computer science and education for over two decades. Jared’s work fuses the physical world with the digital world and plays with relationship between the two. His projects are steeped in metaphor and gesture, and emphasize the power of direct experience.

Jared is the founder of Perch Interactive Inc, a startup intent on revolutionizing the retail environment. PERCH is a digital display platform that attracts customers and motivates them to touch, pick up and discover the products on display. As shoppers engage, PERCH reveals dynamic digital content, directly beside the product. PERCH is the first technology to combine interaction with physical products, display of digital media and user analytics in one retailer-friendly package.

In 2005, Jared co-founded Potion, an interactive design and technology firm located in New York City. Potion has completed major projects for the Smithsonian, the National Holocaust Museum, HP, Bell Labs and the New York Public Library. Potion has been invited to the White House twice as finalists for the National Design Award in the Interaction Design category in 2009 and 2010. Potion was also named one of the “Top 10 Most Innovative Design Companies” in 2010 by Fast Company magazine.

Jared has taught courses at Parsons (The New School) and New York University, and prior to founding Potion, at the Gate’s funded High Tech High in San Diego. Jared received a Masters degree in Media Arts & Science from the MIT Media Lab where he studied with Prof John Maeda in the Aesthetics & Computation Group. Jared also holds an SB in Computer Science & Engineering from MIT as well.

Presentations

Retail's Panacea: How Machine Learning is Driving Product Development Session

Finally, retailers are leveraging data and its insights across all parts of the supply chain to drive a profound transformation in the retail industry. From crafting assortments based on consumer demand signals, to re-imagining consumer interactions with such products, to receiving and processing customer feedback, modern day retailers have re-imagined the product development process.

Bill Schmarzo is responsible for setting the strategy and defining the service line offerings and capabilities for the EMC Consulting Enterprise Information Management and Analytics service line. Bill has more than two decades of experience in data warehousing, BI, and analytic applications. Bill has served on the Data Warehouse Institute’s faculty as the head of the analytic applications curriculum.

Previously, Bill was the vice president of analytics at Yahoo, where he was responsible for the development of Yahoo’s advertiser and website analytics products, including the delivery of actionable insights through a holistic user experience. Before that, Bill oversaw the analytic applications business unit at Business Objects, which included the development, marketing, and sales of their industry-leading analytic applications.

Bill has written several white papers and is a frequent speaker on the use of big data and advanced analytics to power an organization’s key business initiatives. He authored the Business Benefits Analysis methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements and coauthored with Ralph Kimball a series of articles on analytic applications. Bill holds an MBA from the University of Iowa and a BS in mathematics, computer science, and business administration from Coe College.

Presentations

Determining the economic value of your data (EvD) Session

Organizations need a process and supporting frameworks to become more effective at leveraging data and analytics to transform their business models. This workshop will use Big Data Business Model Maturity Index as a guide, and provide worksheets for assessing the business value and implementation feasibility with respect to the monetization potential of an organization’s business use cases.

I am a third year CSE Ph.D. student and IGERT Big Data Fellow at the University of Washington. In addition, I am a core developer for the popular python machine learning package `sklearn` and the author of a probabilistic modeling python package `pomegranate`.

Presentations

pomegranate: flexible probabilistic modeling for python HDS

In this talk I will give a high level overview of the package `pomegranate`, which is a flexible probabilistic modeling package implemented in cython for speed. I will highlight the models it supports, such as Bayesian networks and hidden Markov models, how to implement these models easily, and show how the underlying modular implementation unlocks several benefits for the modern data scientist.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning with TensorFlow 2-Day Training

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies, Inc. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for six years.

Presentations

Cloudy with a chance of on-prem Data 101

The cloud is becoming pervasive, but it isn’t always full of rainbows. Defining a strategy that works for your company or for your use cases is critical to ensuring success. Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations.

Jonathan Seidman is a software engineer on the Partner Engineering team at Cloudera. Previously, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly Media.

Presentations

Architecting a next generation data platform Tutorial

Using the Internet of Things and Customer 360 as an example, we’ll explain how to architect a modern, real-time big data platform leveraging recent advancements in open-source software. We’ll show how components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL along with Apache Hadoop can enable new forms of data processing and analytics.

Dor Sela joined Kahn Lucas in February 2015 with the mission of bringing the voice of the customer to the forefront of the company’s focus. He is a seasoned marketer with an unrelenting passion for consumer insight and understanding.

Dor is the former head the global consumer marketing operations for MasterCard. Dor started his professional career while founding a company during his B.A studies in Tel-Aviv University. He later sold the company and joined P&G in the brand management function. Worked for almost a decade in P&G in various global marketing and managerial positions across 3 continents. He moved in 2004 to Reckitt Benckiser to turnaround the Near East business as a GM. In 2010, Dor moved to Paris, France to become a member of the executive committee of Carrefour – the world 2nd largest retailer. In Carrefour he managed the multibillion Euro Global Own label and helped transition the company form a purely buying focused operation to a consumer goods one.

Presentations

Retail's Panacea: How Machine Learning is Driving Product Development Session

Finally, retailers are leveraging data and its insights across all parts of the supply chain to drive a profound transformation in the retail industry. From crafting assortments based on consumer demand signals, to re-imagining consumer interactions with such products, to receiving and processing customer feedback, modern day retailers have re-imagined the product development process.

Nick Selby is a Texas police detective who investigates computer fraud, and child exploitation. He is also a cyber-security incident responder. A frequent contributor to newspapers including the Washington Post and New York Times, he is co-author of Cyber Survival Manual: From Identity Theft to The Digital Apocalypse and Everything in Between; In Context: Understanding Police Killings of Unarmed Civilians; Blackhatonomics: Understanding the Economics of Cybercrime; and technical editor of Investigating Internet Crimes: An Introduction to Solving Crimes in Cyberspace.

Presentations

The Context of Contacts: Seeking Root Causes of Racial Disparity in Texas Traffic-Summons Fines DCS

The Black Lives Matter movement and the recent focus in the media on how police interact with citizens of different races has helped engage Americans in an important conversation about how law enforcement works. Study after study has found that black drivers are more likely to be stopped and arrested than whites.

Viral Shah is the Co-Founder and CEO of Julia Computing. He is also a Co-creator of the Julia Language and other open source software. Viral is also the co-author of Rebooting India, and drove the re-architecting of the Government’s social security systems in India as part of the National ID project – Aadhaar.

Presentations

Julia and Spark, Better Together Session

Spark is a fast and general engine for large-scale data. Julia is a fast and general engine for large-scale compute. By combining Julia's compute and Spark's data processing capabilities, amazing things are possible.

Tushar is a seasoned executive with track record of building high growth businesses at market-defining companies such as LinkedIn, Cloudera, VMware and Microsoft. Most recently Tushar was VP Products & Design at Arimo, an Andreessen-Horowitz company building data intelligence products using analytics and AI. Tushar is Head of Data Strategy and Data Products at LinkedIn.

Presentations

Taming the ever-evolving Compliance Beast: Lessons learned at LinkedIn Session

We describe the journey of the big data ecosystem at LinkedIn in preserving member privacy while providing data democracy. We discuss three foundational building blocks for scalable data management that can meet data compliance regulations - a central metadata system, an integrated data movement platform and a unified data access layer.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

Architecting a next generation data platform Tutorial

Using the Internet of Things and Customer 360 as an example, we’ll explain how to architect a modern, real-time big data platform leveraging recent advancements in open-source software. We’ll show how components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL along with Apache Hadoop can enable new forms of data processing and analytics.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Unravelling data at scale with Spark using deep learning and other algorithms from machine learning. Tutorial

We walk you through approaches available via machine-learning algorithms available in Spark ml to understand and decipher meaningful patterns in real-world data. Along with discussing the common problems encountered as the data and model sizes scale we will also leverage a few open source deep learning frameworks to run a few classification problems on image and text data sets leveraging Spark.

Dave Shuman is the former Chief Operations Officer for Vision Chain, a leading Demand Signal Repository provider enabling Retailer and Manufacturer collaboration. Prior to that he served as Vice President of Field Operations responsible for customer success and user adoption, and Vice President of Product responsible for product strategy and messaging. Dave started at Vision Chain as Director of Services, and was responsible for implementations at such top CG companies as Kraft Foods, PepsiCo, and General Mills.

Previously, Dave worked for enews Inc., an ecommerce company acquired by Barnes and Noble, where he began as Vice President of Operations. He served as Executive Vice President of Management Information Systems, managing software development, operations and retail analytics. Dave developed ecommerce applications and business processes used by barnesandnoble.com, Yahoo! and Excite, and pioneered an innovative process for affiliate commerce.

Dave has an extensive background in business intelligence applications, database architecture, logical and physical database design and data warehousing. He holds an M.B.A. with a concentration in Information Systems from Temple University and a B.A. from Earlham College.

Presentations

An Open Source Architecture for IoT Session

Eclipse IoT is an ecosystem of organizations that are working together to establish an IoT architecture based on open source technologies and standards. In this session we'll showcase an end-to-end architecture for IoT based on open source standards highlighting Eclipse Kura, which is an open source stack for gateways and the edge, and Eclipse Kapua, an open source IoT cloud platform.

Vartika Singh is a solutions consultant at Cloudera. Previously, Vartika was a data scientist applying machine-learning algorithms to real-world use cases, ranging from clickstream to image processing. She has 10 years of experience designing and developing solutions and frameworks utilizing machine-learning techniques.

Presentations

Securely Building Deep Learning Models for Digital Health Data Tutorial

In this hands-on tutorial, we will teach attendees how to interactively develop and train deep neural networks to analyze digital health data using the Cloudera Workbench and DeepLearning4J (DL4J). Attendees will learn how to use the Workbench to rapidly explore real world clinical data, build data preparation pipelines, and launch training of neural networks.

Vartika Singh is a Data Science Architect at Cloudera with over 12 years of experience applying machine-learning techniques to big data problems.

Presentations

Unravelling data at scale with Spark using deep learning and other algorithms from machine learning. Tutorial

We walk you through approaches available via machine-learning algorithms available in Spark ml to understand and decipher meaningful patterns in real-world data. Along with discussing the common problems encountered as the data and model sizes scale we will also leverage a few open source deep learning frameworks to run a few classification problems on image and text data sets leveraging Spark.

José is a Software Engineer at Cloudera focused on Spark development. His core focus is on the internals of Spark as they matter to customers: reliability, correctness, and performance. He holds Master’s and Bachelor’s degrees from MIT.

Presentations

Fault Tolerance in Spark: Expectations versus Reality Session

Spark is supposed to be fault tolerant, right? It is, for the most part, but sometimes things can go wrong. This talk explores how fault tolerance works, when it doesn’t, and what you can do to debug and harden your production jobs. We’ll draw from lessons learned supporting Cloudera customers and making contributions to the Apache Spark project.

Audrey Spencer-Alvarado is a Business Analyst for the Portland Trail Blazers. Audrey and the other members of the business analytics team provide all data insights to the various decision makers at the Trail Blazers and affiliates. She leads the Tableau Reporting and statistical modeling projects as well.

Presentations

How the Portland Trail Blazers are using personalization techniques and Acxiom data to better target customers in marketing campaigns Session

Generally low conversion rates in marketing campaigns present a huge potential for improved efficiency within marketing teams. This session highlights the techniques used by the Portland Trail Blazers to better identify customers for targeted marketing campaigns and led to a significant increase in their overall conversion rate and revenue generated per lead.

Jennifer is a Principal Program Manager with Microsoft Azure overseeing Microsoft’s approach to metadata management. As a constant learner, Jennifer has put that to practice throughout her career taking on new disciplines including product management, product marketing, engineering, and even a stint in speechwriting for Microsoft’s top executives. 

Presentations

Building a Rosetta Stone for business data Session

The data-driven business must bridge the language gap between data scientists and business users. In this talk, Matthew Roche and Jennifer Stevens walk through how to build a business glossary that codifies your semantic layer and enables greater conversational fluency between business users and data scientists.

Bargava Subramanian is a senior data scientist at Red Hat, Bangalore India. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimization in Python and R around the world. Bargava holds a master’s degree in statistics from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations of his talks can be found here

Presentations

AI-driven Next Generation Developer Tools Session

This talk is about how Machine Learning and Deep Learning techniques are helping Red Hat build smart developer tools to make software developers become more efficient.

Sahaana Suri is a second year PhD student in the Stanford InfoLab, working with Peter Bailis. Sahaana’s research focuses on building easy-to-use, accessible data analytics and machine learning systems that scale. She holds a bachelor’s degree in Electrical Engineering and Computer Science from the University of California, Berkeley.

Presentations

MacroBase: A Search Engine for Fast Data Streams Session

MacroBase is a new analytics engine from Stanford designed to prioritize the scarcest resource in large-scale, fast-moving data streams: human attention. By combining streaming classification and explanation operators, MacroBase allows reconfigurable, real-time root-cause analyses that have already diagnosed issues in production streams in mobile, data center, and industrial applications.

David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cybersecurity. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science along with master’s degrees in both computer science and business administration.

Presentations

Natural language understanding at scale with spaCy, Spark ML & TensorFlow Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. This is a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, TensorFlow for training custom machine learned annotators, and Spark ML & TensorFlow for using deep learning to build & apply word embeddings.

When models go rogue: Hard-earned lessons about using machine learning in production Session

Machine learning and data science systems often fail in production in unexpected ways. This talk shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Sean Taylor is the Manager for the Bioinformatics & High Throughput Analytics team at Seattle Children’s Research Institute (SCRI). In this role, Sean manages the support delivery effort for bioinformatics and computational biology solutions for the eight research centers and almost 1,000 researchers at SCRI. Sean led design and development efforts for SCRI’s integrated precision medicine repository, and is now expanding the open source approaches and big data technologies to additional centers and cores. Prior to this role, Sean led the initiative to develop and implement a state-of-the-art bioinformatics core resource at SCRI. Prior to SCRI, Sean was a computational biologist at Amgen, customizing and driving usability in a range of end user interfaces and visualization tools while applying analytic code from multiple projects for areas such as immunotherapy and inflammation. Prior to Amgen, Sean was a post-doc at Fred Hutchinson Cancer Research Center where he developed a new ultrasensitive assay to detect rare mitochondrial DNA mutations in cancer and aging. Sean received a Ph.D. from Yale University after completing a B.S. at Brigham Young University.

Presentations

Project Rainier Saving Lives One Insight at a Time Session

Leveraging the power of the Hadoop distributed file system and Hadoop and Spark ecosystem, the scientists at Seattle Children’s Research Institute are able to quickly find new patterns and generate predictions that they can test later. The ultimate goal of Project Rainier is to accelerate important pediatric research, and to increase scientific collaboration by highlighting where it is needed.

Richard Tibbetts is CEO of Empirical Systems, an MIT spinout building an AI-based data platform for organizations that use structured data to provide decision support. Previously, he was founder and CTO at StreamBase, a CEP company that merged with TIBCO in 2013, as well as visiting scientist at the Probabilistic Computing Project at MIT.

Presentations

AI for Business Analytics Session

Businesses have spent decades trying to make better decisions by collecting & analyzing structured data. New AI technologies are beginning to transform this process. This talk will focus on AI that (i) guides business analysts to ask statistically sensible questions and (ii) lets junior data scientists answer questions in minutes that previously took hours for trained statisticians.

Steven Totman is Cloudera’s big data subject-matter expert, helping companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Steve works with over 180 customers worldwide and helps across verticals in architectures around data management tools, data models, and ethical data usage. Previously, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents in data integration and governance- and metadata-related designs. Although he is based in NYC, Steve is happiest onsite with customers wherever they may be in the world.

Presentations

Big Data and The Cloud Down Under - Exec Panel Session

Senior execs from a variety of major companies in Australia and New Zealand including Air New Zealand, Westpac, ANZ and BNZ who have been pioneering the adoption of Big Data technologies like Hadoop will share use cases, challenges and how to be successful Down Under in the geographic location on the opposite side of the world from where technologies like Hadoop got started.

Griffin – Fast-tracking model development in Hadoop Session

Griffin is a high-level, easy-to-use framework built on top of Spark which encapsulates the complexities of common model development tasks within four phases – data understanding, feature extraction, model development and serving modelling results.

DB Tsai is an Apache Spark committer and a Senior Research Engineer working on Personalized Recommendation Algorithms at Netflix. He implemented several algorithms including Linear Regression and Binary/Multinomial Logistic Regression with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he led a team to develop innovative large-scale
distributed learning algorithms, and then contributed back to open source Apache Spark project. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master’s degree in Electrical Engineering from Stanford University.

Presentations

Boosting Spark MLlib Performance with Rich Optimization Algorithms Session

Recent developments in Spark MLlib have given users the power to express a wider class of ML models and decrease model training times via the use of custom parameter optimization algorithms. Seth Hendrickson and DB Tsai discuss when and how to use this new API,and walk you through creating your own Spark ML optimizer. They also show performance benefits and real-world use cases.

Madeleine Udell is Assistant Professor of Operations Research and Information Engineering and Richard and Sybil Smith Sesquicentennial Fellow at Cornell University. She studies optimization and machine learning for large scale data analysis and control, with applications in marketing, demographic modeling, medical informatics, and engineering system design. Her recent work on generalized low rank models (GLRMs) extends principal components analysis (PCA) to embed tabular data sets with heterogeneous (numerical, Boolean, categorical, and ordinal) types into a low dimensional space, providing a coherent framework for compressing, denoising, and imputing missing entries. She has developed of a number of open source libraries for modeling and solving optimization problems, including Convex.jl, one of the top ten tools in the new Julia language for technical computing, and is a member of the JuliaOpt organization, which curates high quality optimization software.

Madeleine completed her PhD at Stanford University in Computational & Mathematical Engineering in 2015 under the supervision of Stephen Boyd, and a one year postdoctoral fellowship at Caltech in the Center for the Mathematics of Information hosted by Professor Joel Tropp. At Stanford, she was awarded a NSF Graduate Fellowship, a Gabilan Graduate Fellowship, and a Gerald J. Lieberman Fellowship, and was selected as the doctoral student member of Stanford’s School of Engineering Future Committee to develop a road-map for the future of engineering at Stanford over the next 10–20 years. She received a B.S. degree in Mathematics and Physics, summa cum laude, with honors in mathematics and in physics, from Yale University.

Michelle Ufford leads centralized solutions for Data Engineering & Analytics (DEA) at Netflix. She’s currently focused on data intelligence tooling to make it easier to develop, deploy, and manage complex datasets. Previously, she led the Data Management team at GoDaddy, where she built data engineering solutions to support the company’s innovative advertising strategies and helped pioneer Hadoop data warehousing techniques.

Michelle is also a published author, patented developer, award-winning open-source contributor, and Most Valuable Professional (MVP) for Microsoft Data Platform. You can find her on Twitter (@sqlfool) or connect with her on LinkedIn (mufford)

Presentations

Working Smarter, Not Harder: Driving Data Engineering Efficiency @ Netflix Session

What if we used the wealth of data and experience at our disposal to drive improvements in data engineering? Can we find common patterns amongst the chaos, enabling us to automate repetitive or time-consuming tasks? Can we find ways to improve data quality? Reduce costs? Quickly identify & respond to issues? In this talk, Michelle Ufford will share how Netflix is tackling these very questions.

Amy Unruh is a developer programs engineer for the Google Cloud Platform, with a focus on machine learning and data analytics as well as other Cloud Platform technologies. Amy has an academic background in CS/AI and has also worked at several startups, done industrial R&D, and published a book on App Engine.

Presentations

Getting started with TensorFlow Tutorial

We will walk you through training and deploying a machine-learning system using TensorFlow, a popular open source library. Starting from conceptual overviews, we will build all the way up to complex classifiers. You’ll gain insight into deep learning and how it can apply to complex problems in science and industry.

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Vinithra was previously a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

A Deep Dive into Running Data Engineering Workloads in AWS Tutorial

Data engineering workloads are foundational workloads run prior to most analytic and operational database use cases. This hands-on tutorial will provide a deep dive into running data engineering workloads in a managed service capacity in the public cloud; highlight AWS infrastructure best practices; and discuss how data engineering workloads interoperate with data analytic workloads.

Ashish Verma is a managing director at Deloitte, where he leads the Big Data and IoT Analytics practice, building offerings and accelerators to enhance business processes and effectiveness. Ashish has more than 18 years of management consulting experience helping Fortune 100 companies build solutions that focus on addressing complex business problems related to realizing the value of information assets within an enterprise.

Presentations

Executive Briefing: From data insights to action—Developing a data-driven company culture Session

Ashish Verma explores the challenges organizations face after investing in hardware and software to power their analytics projects and the missteps that lead to inadequate data practices. Ashish explains how to course-correct and implement an insight-driven organization (IDO) framework that enables you to derive tangible value from your data faster.

Dean Wampler, Ph.D., is the VP of Fast Data Engineering at Lightbend, leading the development of Lightbend Fast Data Platform, a scalable, distributed stream data processing stack using Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly Media. He is a contributor to several open source projects and the co-organizer of several conferences around the world and several user groups in Chicago. Dean can be found on Twitter as @deanwampler.

Presentations

Stream All the Things!! Session

While stream processing is now popular, streaming architectures must be highly reliable and scalable as never before, more like microservice architectures. Using specific use cases, I'll define the requirements for streaming systems and how they are met by popular tools like Kafka, Spark, Flink, and Akka. I'll argue that streaming and microservice architectures are actually converging.

Emmie leads the data team at Air New Zealand. She’s established a Data Management framework and formed a Data Management Solutions team delivering Data Governance, Information Architecture, Shared Data Solutions, Data Quality framework to enable data driven decision making within the organisation and deliver data platforms for quality shared data as an enabler for all digital platforms.
• Drive awareness of the importance of the data asset by communicating a shared vision for Customer and Operational data and developing a Data Governance framework which supports adherence to compliance and mitigates risks of adverse customer experiences.
• Established Data Principles and strategy for shared data culture, Master Data solutions, taxonomies and Data Quality measures to lay the foundation for a data driven organisation.
•Implemented a Big Data technology platform to deliver insights in Network Scheduling business area resulting in adjustment of flight schedules which increased revenue opportunities and support contractual obligations.
•Formed a high performance data management solutions team to deliver shared data solutions, consult to the business on data matters, deliver standard data designs, master data taxonomies and data quality measures and dashboards which allows operational efficiencies for revenue support teams and accurate customer data leading to more reliable data delivered to customers.

Presentations

Big Data and The Cloud Down Under - Exec Panel Session

Senior execs from a variety of major companies in Australia and New Zealand including Air New Zealand, Westpac, ANZ and BNZ who have been pioneering the adoption of Big Data technologies like Hadoop will share use cases, challenges and how to be successful Down Under in the geographic location on the opposite side of the world from where technologies like Hadoop got started.

Edd Wilder-James is a technology analyst, writer, and entrepreneur based in California. He’s helping transform businesses with data as VP of strategy for Silicon Valley Data Science. Formerly Edd Dumbill, Edd was the founding program chair for the O’Reilly Strata conferences and chaired the Open Source Convention for six years. He was also the founding editor of the peer-reviewed journal Big Data. A startup veteran, Edd was the founder and creator of the Expectnation conference-management system and a cofounder of the Pharmalicensing.com online intellectual-property exchange. An advocate and contributor to open source software, Edd has contributed to various projects such as Debian and GNOME and created the DOAP vocabulary for describing software projects. Edd has written four books, including O’Reilly’s Learning Rails.

Presentations

Developing a Modern Enterprise Data Strategy Session

Fundamentally, data should serve the strategic imperatives of a business—those key aspirations that will define an organization’s future vision. Conventional data strategy has little to guide us, focusing more on governance than on creating new value. In this tutorial, we explain how to create a modern data strategy that powers data-driven business.

The business case for AI, Spark, and friends Data 101

AI is white-hot at the moment, but where can it really be used? Developers are usually the first to understand why some technologies cause more excitement than others. Edd Wilder-James relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring

Rose Winterton leads the product direction for Pitney Bowes Software’s Location Intelligence products and solutions with a recent focus on use of spatial processing in big data environments. She has 15 years experience in location intelligence and has lead both product management and services teams at PB.

Rose is passionate about how people use Location Intelligence and has a wide range of personal customer experience in EMEA and the US, covering telecommunications, insurance, public sector, geosciences and retail vertical markets. Rose worked hands on developing customer solutions as a senior consultant before moving into management. Rose studied GIS and Remote Sensing at University College London and Geology at Oxford University.

Presentations

Benefits of Big Data GeoEnrichment for better business outcomes DCS

Location Intelligence has traditionally been used to join disparate datasets, analyze data in a spatial context and operationalize business rules. The ability to run these processes in a big data environment means that the volume, variety and velocity of datasets that can be utilized has increased. Geo-enrichment uses a location based key to manage data and provide a single view of a location.

Ian Wrigley has taught tens of thousands of students over the last 25 years in subjects ranging from C programming to Hadoop development and administration. Ian is currently the director of education services at Confluent, where he heads the team building and delivering courses focused on Apache Kafka and its ecosystem.

Presentations

Building Real-Time Data Pipelines with Apache Kafka Tutorial

This hands-on workshop is designed for people interested in using Apache Kafka to build real-time streaming data pipelines. By the end of the tutorial, attendees will have seen now Kafka Connect and the Kafka Streams API can be used to ingest and process data in real time, as it is being generated. We assume no prior knowledge of Kafka. The tutorial includes hands-on exercises.

Bichen Wu is a PhD student at UC Berkeley focusing on deep learning, computer vision & autonomous driving.

Presentations

Efficient Neural Networks for Perception for Autonomous Vehicles HDS

In this talk, we will focus on perception tasks for autonomous driving and discuss how we designed efficient neural networks to address above problems.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud strategy and solutions. Before joining Cloudera, Jennifer worked as a product line manager at VMware.

Presentations

A Deep Dive into Running Data Engineering Workloads in AWS Tutorial

Data engineering workloads are foundational workloads run prior to most analytic and operational database use cases. This hands-on tutorial will provide a deep dive into running data engineering workloads in a managed service capacity in the public cloud; highlight AWS infrastructure best practices; and discuss how data engineering workloads interoperate with data analytic workloads.

How to Successfully Run Data Pipelines in the Cloud Session

Data Engineering is the foundational workload run prior to implementing most data analytic and operational database use cases. This talk will explore the latest technologies that deliver data engineering-as-a-service as well as a customer case study in which this technology is integrated into a real-world data analytics pipeline.

Microsoft

Presentations

Performance tuning your Hadoop/Spark clusters to use cloud storage Session

Remote storage in the cloud provides an infinitely scalable, cost-effective, and performant solution for big data customers. Adoption is rapid since there is flexibility and cost savings associated with unlimited storage capacity when separating compute and storage.

Yan is an engineer on the Voldemort and Venice team, within LinkedIn’s data infrastructure organization. He has extensive experience working on cluster management, Zookeeper, Helix and distributed systems in general.

Presentations

Introducing Venice: a Derived Datastore for Batch, Streaming and Lambda Architectures Session

As companies build batch and stream processing pipelines, the next stage in their evolution is to serve the insights they gleaned back to their users. This often-overlooked problem can be hard to achieve reliably and at scale. Venice is a new datastore capable of ingesting data from Hadoop and Kafka, merging it together, replicating it globally, and serving it online at low latency.

Fangjin Yang is a coauthor of the open source Druid project and a cofounder of Imply, a data analytics startup based in San Francisco. Previously, Fangjin held senior engineering positions at Metamarkets and Cisco Systems. Fangjin holds a BASc in electrical engineering and an MASc in computer engineering from the University of Waterloo, Canada.

Presentations

Analytics at Wikipedia Session

The Wikimedia Foundation is a non-profit and charitable organization and the parent company of Wikipedia. As one of the most visited web sites in the world, we face many unique challenges around better understanding our ecosystem of editors, readers, and content . In this session, we will discuss how we do analytics at the WMF, and cover the technology we use for our data.

Yuhao Yang is a software engineer at Intel focused on providing implementation, consulting, and tuning advice on the Hadoop ecosystem to industry partners. Yuhao’s area of focus is distributed machine learning, especially large-scale analytical applications and infrastructure on Spark. He’s also an active contributor of Spark MLlib, delivered the implementation of online LDA, QR decomposition, and several transformers of Spark feature engineering, and provided improvements on some important algorithms.

Presentations

Building Advanced Analytics and Deep Learning on Apache Spark with BigDL Session

We’d like to share experiences on building end to end analytics and deep learning applications on top of BigDL and Spark, including speech recognitions, object detection, etc. We’ll also introduce the recent developments in BigDL including Python APIs, notebook and TensorBoard support, TensorFlow model R/W support, better recurrent and recursive net support, 3D image convolutions, etc.

Mike Yoder is a software engineer at Cloudera who has worked on a variety of Hadoop security features and internal security initiatives. Most recently, he implemented log redaction and the encryption of sensitive configuration values in Cloudera Manager. Prior to Cloudera, he was a security architect at Vormetric.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Kelley Yohe has been a key Agile delivery manager in building and enhancing data services and banking platforms for a number of startups and businesses that have successfully challenged norms in financial services. With data at the core of her focus, Kelley has assisted businesses in building, enhancing, governing, and innovating their data strategies with tools, processes, techniques, and platforms, including an array of open source and cloud-based technologies.

Presentations

Big Data and The Cloud Down Under - Exec Panel Session

Senior execs from a variety of major companies in Australia and New Zealand including Air New Zealand, Westpac, ANZ and BNZ who have been pioneering the adoption of Big Data technologies like Hadoop will share use cases, challenges and how to be successful Down Under in the geographic location on the opposite side of the world from where technologies like Hadoop got started.

Lucy is a graduate from MIT where she majored in computer science. She worked with Matei Zaharia for her Masters of Engineering, implementing an experimental framework for work-sharing in Spark.

Presentations

Exploring Real-Time Capabilities with Spark SQL Session

This talk will explore how we could extend the Spark SQL abstraction to support more complex pushdown such as group by, sub-queries, and joins.

Matei is an assistant professor at Stanford CS, where he works on computer systems and big data.

Presentations

Weld: Accelerating Data Science by 100x Session

Modern data applications combine functions from many optimized libraries (e.g., Pandas and TensorFlow), and yet do not achieve peak hardware performance due to data movement across functions. Weld is a new interface to implement functions in these libraries while enabling optimizations across them. Weld can be integrated into libraries such as Pandas or Spark SQL with no changes to user code.

Ben is a data scientist and developer at Continuum Analytics. He has several years of experience with Python and is passionate about any and all forms of data. Currently he spends his time thinking about usability of large data systems and infrastructure problems as they relate to data management and analysis.

Presentations

Scaling Python Data Analysis Tutorial

The Python Data science stack (NumPy, Pandas, Scikit-Learn) is efficient and intuitive but only for in-memory data and a single core. This tutorial teaches you to parallelize and scale your Python workloads to multi-core machines and multi-machine clusters. We use a variety of tools. This comparative approach encourages us to think broadly about parallel tools and programming paradigms.

Tristan Zajonc is a senior engineering manager at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and an MPA in international development from Harvard and a BA in economics from Pomona College.

Presentations

Data Science at Team Scale: Considerations for sharing, collaboration, and getting to production Session

Data science alone is easy. Data Science with others, in the enterprise, on shared distributed systems, requires a bit more work. This talk will discuss common technology considerations and patterns for collaboration in large teams, as well as moving machine learning into production at scale.

David Zhou is a Senior Director at Samsung where he heads their Big Data Platform and Engineering team.

Presentations

How to Successfully Run Data Pipelines in the Cloud Session

Data Engineering is the foundational workload run prior to implementing most data analytic and operational database use cases. This talk will explore the latest technologies that deliver data engineering-as-a-service as well as a customer case study in which this technology is integrated into a real-world data analytics pipeline.