Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Speakers

Hear from innovative data scientists, senior engineers, and leading executives who are doing amazing things with data. More speakers will be announced; please check back for updates.

Filter

Search Speakers

Bill Chambers is a product manager at Databricks, where he works on Structured Streaming and data science products. He is lead author of Spark: The Definitive Guide, coauthored with Matei Zaharia. Bill also created SparkTutorials.net as a way to teach Apache Spark basics. Bill holds a master’s degree in information management and systems from UC Berkeley’s School of Information. During his time at school, Bill was also creator of the Data Analysis in Python with pandas course for Udemy and cocreator of and first instructor for Python for Data Science, part of UC Berkeley’s Masters of Data Science program.

Presentations

Streaming big data in the cloud: What to consider and why Session

Streaming big data is a rapidly growing field and one that currently involves a lot of operational complexity and expertise. This talk will discuss a decision making framework for attendees about how to reason about the tools and technologies with which they can be successful deploying and maintaining streaming data pipelines to solve business problems.

A software engineer on the Cloud Team with Cloudera.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.

I am an astrophysicist using data science techniques to study the Universe.

Presentations

Learning Machine Learning using Astronomy data sets Tutorial

We present an intermediate Machine Learning tutorial based on actual problems in Astronomy research. Our strengths are that we use interesting, diverse, publicly available data sets; we feature students' feedback as "best and worst" content; we focus on the customization of algorithms and evaluation metrics required by scientific applications; and we propose open problems to our participants.

Nishith works on the Hudi project & Hadoop platform at large at Uber. His interests lie in large scale distributed and data systems.

Presentations

Hudi : Unifying storage & serving for batch & near real-time analytics Session

Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers share the design, architecture & use-cases of the second generation of ‘Hudi’, an analytical storage engine designed to serve such needs and beyond.

Vijay Srinivas Agneeswaran is a senior director of technology at SapientRazorfish. Vijay has spent the last 10 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine learning algorithms, such as LDA and random forests, over Spark. He also led a team that build a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He is a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies. Vijay holds a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.

Presentations

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Arpan is software engineer at LinkedIn working with Analytics Platforms and Applications team. He holds a graduate degree in Computer Science and Engineering from IIT Kanpur.

Presentations

TuneIn: How to get your jobs tuned while you are sleeping Session

Have you ever tuned a Spark or MR job? If the answer is yes, then you already know how difficult it is to tune more than hundred parameters to optimize the resources used. With Dr. Elephant we introduced heuristic based tuning recommendations. Now, we introduce TuneIn, an auto tuning tool developed to minimize the resource usage of jobs. Experiments have shown upto 50% reduction in resource usage.

Adil Aijaz is CEO and co-founder at Split Software. Adil brings over ten years of engineering and technical experience having worked as a software engineer and technical specialist at some of the most innovative enterprise companies such as LinkedIn, Yahoo!, and most recently RelateIQ (acquired by Salesforce). Prior to founding Split in 2015, Adil’s tenure at these companies helped build the foundation for the startup giving him the needed experience in solving data-driven challenges and delivering data infrastructure. Adil holds a Bachelor of Science in Computer Science & Engineering from UCLA and a Master of Engineering in Computer Science from Cornell University.

Presentations

The Lure of "One Metric That Matters" Session

Many products - whether data driven or not - chase “the one metric that matters”. It may be engagement, revenue, or conversion, but the common theme is the pursuit of improvement in one metric. Product development teams should focus on the design of metrics that measure our goals. Adil will present an approach to designing metrics, discuss best practices and common pitfalls that you may run into.

Amro is a data scientist with National Health Insurance Company – Daman, a leading health insurance company headquartered in Abu Dhabi- UAE. His focus is on business driven AI expert systems for health insurance. He holds an MSc in quantum computing from Masdar Institute in partnership with MIT. He received his BSc in computer systems engineering from Birzeit University in 2009.

Presentations

Real-time automated claim processing: the surprising utility of NLP methods on non-text data Findata

Processing claims is central to every insurance business. We present a successful business case for automating claims processing from idea to production. The machine learning based claim automation model uses NLP methods on non-text data and allows auditable automated claims decisions to be made.

Archana Anandakrishnan is a Senior Data Scientist in the Decision Science Organization at American Express. She works on developing Data Products such as the one presented here that accelerate the modelling life cycle and adoption of new methods at American Express. She is currently a lead developer and contributor to DataQC Studio. Prior to joining American Express in 2015, she was a particle Physics researcher working as a post-doc at Cornell University and obtained her PhD in Physics from the Ohio State University. She is also passionate about mentoring and is currently a workplace mentor with Big Brothers Big Sisters, NYC.

Presentations

Let the Machines Learn to Improve Data Quality Session

Building accurate Machine Learning models hinges on the quality of the data. Errors and anomalies get in the way of data scientists doing their best work. Learn how American Express created an automated, scalable system for measurement and management of data quality, and you can too. The methods described are modular and adaptable to any domain where accurate decisions from ML models are critical.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time systems with Spark Streaming and Kafka 1-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks and explains how to choose the right one for your company.

Alberto was born in Argentina, in the last decade he has been working for a few big companies like Motorola, Intel, and Samsung before he switched to consulting, specializing in the field of Machine Learning.
He has written lots of low level code in C/C++, and was an early Scala enthusiast and developer.
Currently, he is part of the Spark NLP team at JohnSnowLabs where he works as a Data Scientist implementing state of the art NLP algorithms on top of Spark.
As a lifelong learner, he holds a degree in Engineering, another in Computer Science, and is working on a third one, focused on AI.
He enjoys the outdoors, particularly hiking and camping in the mountains of Argentina.

Presentations

Spark NLP in Action: How SelectData Uses AI to Better Understand Home Health Patients Session

This case study describes a question answering system, for accurately extracting facts from free-text patient records. The solution is based on Spark NLP - an open source extension of Spark ML, providing state of the art performance and accuracy for natural language understanding. We'll share best practices for training domain specific deep learning NLP models as such problems usually require.

Patrick is the Chief Architect for Financial Services at Cloudera.

Presentations

Too Big Data To Fail: How banks use big data to prevent the next financial crisis Findata

The financial crisis of 2008 exposed systemic issues in the financial system that resulted in the failures of several established institutions and a bailout of the entire industry. In the aftermath, banks and regulators are turning to big data solutions in order to avoid a repeat of history.

Mauricio Aristizabal is the Data Pipeline Architect at Impact (formerly Impact Radius), a marketing technology company that helps brands grow by optimizing their paid marketing and media spend. At Impact Mauricio was responsible for massively scaling and modernizing our analytics capabilities, selecting datastores and processing platforms and designing many of the jobs that process internally and externally captured data and make it available to our report and dashboard users, analytic applications and machine learning jobs; he has also assisted our operations team with maintaining and tuning our Hadoop and Kafka clusters.

Presentations

Real time analytics and BI with Data Lake and Data Warehouse using Kudu, HBase, Spark and Kafka: Lessons learned Session

Lessons learned from migrating Impact's traditional ETL platform to real-time on Hadoop (leveraging the full Cloudera EDH stack). A Data Lake in HBase, Spark Streaming jobs (with Spark SQL), Kudu for 'fast data' BI queries, and Kafka data bus for loose coupling between components are some of the topics we'll explore in detail.

Presentations

From Training to Serving: Deploying Tensorflow Models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Join this tutorial to learn how to bring deep learning models from training to serving in a cloud production environment. You will learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Sudhanshu Arora is a software engineer at Cloudera, where he leads the development for data management and governance solutions. Previously, Sudhanshu was with the platform team at Informatica, where he helped design and implement its next-generation metadata repository.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.

TBD

Presentations

Building A Large-Scale Machine Learning Application Using Amazon SageMaker and Spark Tutorial

Outline - What is Amazon SageMaker? Quick product overview of AWS's newest ML Platform - Create a Spark EMR cluster - Integrate SageMaker algorithms into Spark pipelines - Ensemble multiple models for a real-time prediction task

Ahsan Ashraf is a data scientist at Pinterest, focusing on recommendations and ranking for the Discovery team. Previously, Ahsan worked with personal finance startup, wallet.ai, as part of an Insight Data Science Fellowship where he designed and built a recommender system that drew insights into users spending habits from their transaction history. Ahsan holds a PhD in condensed/soft matter physics.

Presentations

Diversification in recommender systems: Using topical variety to increase user satisfaction Session

Online recommender systems often rely heavily on user engagement features. This can cause a bias towards exploitation over exploration, over-optimizing on users' interests. Content diversification is important for user satisfaction, however measuring and evaluating impact is challenging. This work outlines techniques used at Pinterest that drove ~2-3% impression gains and a ~1% time spent gain.

An accomplished speaker, Stacy Ashworth is a Registered Nurse with a Master’s Degree in Health Care Administration with an emphasis in Informatics. Currently employed in a clinical intelligence role, Ms. Ashworth’s professional interests lie in the use of technology to improve the quality of care through better decision making.
She has served as a contributor to the Healthcare Informatics and Technology Track of the 2016 Business and Health Administration Association Meeting performing research regarding the Evaluation of Glucose Monitoring Technologies for Cost Effective and Quality Control/Management of Diabetes.
Post-Acute Care, Geriatrics and Coding may be her passions, but her love is firmly centered on family with 2 lively teenagers, a spouse, and a couple of schnauzers to keep things interesting.

Presentations

Spark NLP in Action: How SelectData Uses AI to Better Understand Home Health Patients Session

This case study describes a question answering system, for accurately extracting facts from free-text patient records. The solution is based on Spark NLP - an open source extension of Spark ML, providing state of the art performance and accuracy for natural language understanding. We'll share best practices for training domain specific deep learning NLP models as such problems usually require.

Tony Baer leads Ovum’s research in Big Data, middleware, and the management of embedded software development in the product lifecycle. Tony has defined the architecture, use cases, and market outlook for Big Data and led the industry’s first global enterprise survey on Big Data adoption.

Tony has been a noted authority on data management, integration architecture, and software development platforms for nearly 20 years. Prior to joining Ovum, he was an independent analyst whose company ‘onStrategies’ delivered software development and integration tools to vendors with technology assessment and market positioning services.

He co-authored some of the earliest books on the Java and .NET frameworks including Understanding the .NET Framework and J2EE Technology in Practice.

His career began as journalist with leading publications including Computerworld, Application Development Trends, Computergram, Software Magazine, Information Week and Manufacturing Business Technology.

Presentations

Executive Briefing: Profit from AI and Machine Learning – The best practices for people & process Session

Ovum will present the results of research cosponsored by Dataiku, surveying a specially selected sample of chief data officers and data scientists, on how to map roles and processes to make success with AI in the business repeatable.

Marton Balassi is a solutions architect at Cloudera, where he focuses on data science and stream processing with big data tools. Marton is a PMC member at Apache Flink and a regular contributor to open source. He is a frequent speaker at big data-related conferences and meetups, including Hadoop Summit, Spark Summit, and Apache Big Data.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Previously, Roger was in the Cloud Machine Learning Group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.

Presentations

Continuous machine learning over streaming data, the story continues. Session

Understand how unsupervised learning can provide insights into streaming data, with new applications to impute missing values, to forecast future values, detect hotspots and perform classification tasks, and how to efficiently implement to operate in real-time over massive data streams.

Dylan Bargteil is a data scientist in residence at the Data Incubator, where he works on research-guided curriculum development and instruction. Previously, he worked with deep learning models to assist surgical robots and was a research and teaching assistant at the University of Maryland, where he developed a new introductory physics curriculum and pedagogy in partnership with HHMI. Dylan studied physics and math at University of Maryland and holds a PhD in physics from New York University.

Presentations

Machine Learning from Scratch in TensorFlow 1-Day Training

The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. This training will introduce TensorFlow's capabilities through its Python interface.

Bonnie Barrilleaux is a staff data scientist in analytics at LinkedIn, who primarily focuses on communities and the content ecosystem. She uses data to guide product strategy, performs experiments to understand the ecosystem, and creates metrics to evaluate product performance. Previously, she completed a postdoctoral fellowship in genomics at the University of California, Davis, studying the function of the Myc gene in cancer and stem cells. She holds a PhD in Chemical Engineering from Tulane University; has published peer-reviewed works including 11 journal articles, a book chapter, and a video article; and has been awarded multiple grants to create interactive art.

Presentations

Perverse incentives in metrics: inequality in the like economy Session

Following metrics blindly leads to unintended negative side-effects. At LinkedIn as we encouraged members to join conversations, we found ourselves in danger of creating a "rich get richer" economy in which a few creators got an increasing share of all feedback. This example reminds us to regularly re-evaluate metrics, because creating value for users is more important than driving any metric.

James Bednar is a senior solutions architect at Anaconda. Previously, Jim was a lecturer and researcher in computational neuroscience at the University of Edinburgh, Scotland, and a software and hardware engineer at National Instruments. He manages the open source Python projects datashader, HoloViews, GeoViews, ImaGen, and Param. He has published more than 50 papers and books about the visual system, data visualization, and software development. Jim holds a PhD in computer science from the University of Texas as well as degrees in electrical engineering and philosophy.

Presentations

Making interactive browser-based visualizations easy in Python Tutorial

Python lets you solve data-science problems by stitching together packages from the Python ecosystem, but it can be difficult to assemble the right tools to solve real-world problems. Here we show how to use the 15+ packages covered by the new PyViz.org initiative to make it simple to build interactive plots and dashboards, even for large, streaming, and highly multidimensional data.

William Benton leads a team of data scientists and engineers at Red Hat, where he has applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy intelligent applications in cloud-native environments, but he has also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.

Presentations

Why data scientists should love Linux containers Session

Containers are a hot technology for application developers, but they also provide key benefits for data scientists. In this talk, you'll learn about the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively.

Mike Berger is Vice President, Pop Health Informatics & Data Science at Mount Sinai Health – delivering on the promise to transform into the premier population health management health system in the New York metro area. His role includes developing and implementing data-driven clinical and actuarial decision support through the use of advanced analytics, machine learning and timely operational BI.

Mike was recently named a 2018 Top 50 Data and Analytics Professional in the US and Canada by Corinium Intelligence. He has over twenty years of experience across a combination of large Academic Medical Centers and payer organizations as well as entrepreneurial startups and management consulting. Mike is the co-chair for the HIMSS Clinical & Business Intelligence community, hosting a webinar series for analytics thought leaders.

Originally from Huntington Beach, CA, Mike has an Industrial and Systems Engineering degree from USC, a healthcare project management certification from Harvard’s Graduate School of Public Health and recently got his master’s from NYU Stern.

Presentations

Decision-Centricity: Operationalizing Analytics and Data Science in Health Systems Data Case Studies

Hear how Mount Sinai Health has moved up the analytics maturity chart to deliver business value in new risk models around Population Health. Learn how to design a team, build a data factory and generate the analytics to drive decision-centricity. See examples of mixing Tableau, SQL, Hive, APIs, Python and R into a cohesive ecosystem supported by our data factory

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer experience. Tim can frequently be found at speaking at conferences internationally and in the United States. He is the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio Podcast. He lives in Littleton, Colorado, with the wife of his youth and their youngest child, the other two having mostly grown up.

Presentations

Stream Processing with Kafka and KSQL Tutorial

A solid introduction to Apache Kafka as a streaming data platform. We'll cover its internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams—then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka.

Gina Bianchini is the Founder & CEO of Mighty Networks and an expert on network effects. Mighty Networks is a rapidly growing SaaS platform for growing and monetizing your own network effects brands and businesses –– replacing a blog or website with something much more powerful.

Before Mighty Networks, Gina and Netscape co-founder Marc Andreessen launched Ning, a pioneering global platform for creating niche social networks. Under her leadership, Ning grew to ~100 million people in 300,000 active social networks across subcultures, professional networks, entertainment, politics, and education.

In addition to Mighty Networks, Gina served as a board director of Scripps Networks (NASDAQ: SNI), an $12 billion dollar public company which owns HGTV, The Food Network, and The Travel Channel that just merged with Discovery Communications; and TEGNA (NYSE: TGNA), a $3 billion dollar broadcast and digital media company. She also co-founded LeanIn.Org with Sheryl Sandberg, an organization dedicated to women leaning into their ambitions, where she launched Lean In Circles worldwide.

Gina and Mighty Networks have been featured in Fast Company, Wired, Vanity Fair, Bloomberg, and The New York Times. She has appeared on Charlie Rose, CNBC, and CNN. She grew up in Cupertino, California, graduated with honors from Stanford University, started her career in the nascent High Technology Group at Goldman, Sachs & Co., and received her M.B.A from Stanford Business School.

Presentations

Keynote with Gina Bianchini Keynote

Gina Bianchini, Founder & CEO of Mighty Networks

Anya Bida is a senior member of the technical staff (SRE) at Salesforce. She’s also a co-organizer of the SF Big Analytics meetup group and is always looking for ways to make platforms more scalable, cost efficient, and secure. Previously, Anya worked at Alpine Data, where she focused on Spark operations.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Albert Bifet is a Professor at LTCI, Telecom ParisTech, Head of the Data, Intelligence and Graphs (DIG) Group at Telecom ParisTech, and Scientific Collaborator at Ecole Polytechnique. He is a big data scientist with 10+ years of international experience in research, Albert has led new open source software projects for business analytics, data mining, and machine learning at Huawei, Yahoo, the University of Waikato, and UPC. At Yahoo Labs, he cofounded Apache SAMOA (Scalable Advanced Massive Online Analysis), a distributed streaming machine learning framework that contains a programing abstraction for distributed streaming ML algorithms. At the WEKA Machine Learning Group, he co-led MOA (Massive Online Analysis), the most popular open source framework for data stream mining, with more than 20,000 downloads each year. Albert is the author of Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams and the editor of the “Big Data Mining” special issue of SIGKDD Explorations in 2012. Currently, he was cochair of the industrial track at ECML PKDD 2015, BigMine (2014, 2013, 2012), and the data streams track at ACM SAC (2015, 2014, 2013, 2012). He holds a PhD from BarcelonaTech.

Presentations

Machine learning for non-stationary streaming data using Structured Streaming and StreamDM Session

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. This talk will cover how StreamDM can be used alongside Structured Streaming for build incremental models specially for non-stationary streams (i.e. those with concept drifts). Concretely, we will cover how to develop, apply and evaluate learning models using StreamDM and Structured Streaming.

Ryan Blue is an engineer on Netflix’s Big Data Platform team. Before Netflix, Ryan was responsible for the Avro and Parquet file formats at Cloudera. He is also the author of the Analytic Data Storage in Hadoop series of screencasts from O’Reilly.

Presentations

Introducing Iceberg: Tables Designed for Object Stores Session

Iceberg is a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.

The evolution of Netflix's S3 data warehouse Session

In the last few years, Netflix's data warehouse has grown to more than 100PB in S3. This talk will summarize what we've learned, the tools we currently use and those we've retired, as well as the improvements we are rolling out, including Iceberg, a new table format for S3.

Matt leads the machine learning product team at Cloudera, guiding the platform experience for data scientists and data engineers, including products like Cloudera Data Science Workbench. Before that, he led Cloudera’s product marketing team for three years, with roles spanning product, solution, and partner marketing. Prior to Cloudera, he built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in Computer Science and Mathematics from the University of Massachusetts Amherst.

Presentations

A roadmap for open data science Session

An overview of considerations and tradeoffs for choosing an open approach to enterprise data science. In this talk we’ll share a model to help organizations begin the journey, build momentum, and reduce reliance on legacy software. This includes such things as executive leadership, cost transparency, and clear metrics of user adoption and success with open data science tools.

Claudiu Branzan is the VP of data science and engineering at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Natural Language Understanding at Scale with Spark NLP Tutorial

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable, open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Mikio Braun is principal engineer for search at Zalando, one of the biggest European fashion platforms. Mikio holds a PhD in machine learning and worked in research for a number of years before becoming interested in putting research results to good use in the industry.

Presentations

Executive Briefing: from Business to AI - missing pieces in becoming "AI ready" Session

In order to become "AI ready", an organization not just has to provide the right technical infrastructure for data collection and processing, but also learn new skills. In this talk I will highlight three such missing pieces: making the connection between business problems and AI technology, AI driven development, and how to run AI based projects.

Machine learning for time series: What works and what doesn't Session

Time series data has many applications in industry, from analyzing server metrics to monitoring IoT signals and outlier detection. Mikio Braun offers an overview of time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations for what works and what doesn’t, and industry use cases.

Lindsay is a motivated, curious, and analytical data scientist with more than a decade of experience with research methods and the scientific process. From generating testable hypotheses, through wrangling imperfect data, to finding insights via analytical models, she excels at asking incisive questions and using data to tell compelling stories.
Lindsay is passionate about teaching the skills necessary to analyze data more efficiently and effectively. Through this work, she has developed and taught workshops and online courses at the University of New Brunswick, and is a Data Carpentry instructor and Ladies Learning Code chapter co-lead. Having recently made a career pivot from biogeochemistry to data science, she is also well-positioned to provide insight into the applicability of academic research and analysis skills to business problems.

Presentations

From Theory to Data Product - Applying Data Science Methods to Effect Business Change Tutorial

This tutorial is a primer on crafting well-conceived data science projects on course toward uncovering valuable business insights. Using case studies and hands-on skills development, we will teach techniques that are essential for a variety of audiences invested in effecting real business change.

Founder/CEO, Blue Badge Insights; ZDNet Big Data blogger; Gigaom analyst; Microsoft tech influencer.

Presentations

Data Governance: A Big Job That's Getting Bigger Session

Data governance is a product category that has grown from a set of mostly data management-oriented technologies in the data warehouse era, to encompass catalogs, glossaries and more in the data lake era. Now new requirements are emerging and new products are rising to meet the challenge. This session tracks data governance's past present and future.

Issac works at LinkedIn in the data management team which is in charge of ingestion, lifecycle, and compliance of most HDFS data, as well as providing tools for the big data ecosystem in LinkedIn. He is a core developer and committer for Apache Gobblin, a distributed big data integration framework for batch and streaming systems. Previous work focused on analytics for video streaming.

Presentations

Enforcing GDPR Compliance at Scale Session

With over 100 million LinkedIn members in the EU, enforcing GDPR compliance is challenging. In this talk, we explain the architecture of our system and how we leverage Hive, Kafka, Gobblin, and WhereHows to ensure compliance.

Andrew Burt is chief privacy officer and legal engineer at Immuta, the data management platform for the world’s most secure organizations. He is also a visiting fellow at Yale Law School’s Information Society Project. Previously, Andrew was a special advisor for policy to the head of the FBI Cyber Division, where he served as lead author on the FBI’s after-action report on the 2014 attack on Sony. The leading authority on the intersection between machine learning, regulation and law, Andrew has published articles on technology, history and law in the New York Times, the Financial Times, Slate, and the Yale Journal of International Affairs.” His book, American Hysteria: The Untold Story of Mass Political Extremism in the United States, was called “a must-read book dealing with a topic few want to tackle” by Nobel laureate Archbishop Emeritus Desmond Tutu. Andrew holds a JD from Yale Law School and a BA from McGill University. He is a term-member of the Council on Foreign Relations, a member of the Washington, DC, and Virginia State Bars, and a Global Information Assurance Certified (GIAC) cyber incident response handler.

Presentations

Beyond Explainability: Regulating Machine Learning In Practice Session

Machine learning is becoming prevalent across industries, creating new types of risk. Managing this risk is quickly becoming the central challenge of major organizations, one that strains data science teams, legal personnel and the c-suite alike. This talk will highlight lessons from past regulations focused on similar technology, and conclude with a proposal for new ways to manage risk in ML.

Michelle Casbon is a senior engineer on the Google Cloud Platform developer relations team, where she focuses on open source contributions and community engagement for machine learning and big data tools. Michelle’s development experience spans more than a decade and has primarily focused on multilingual natural language processing, system architecture and integration, and continuous delivery pipelines for machine learning applications. Previously, she was a senior engineer and director of data science at several San Francisco-based startups, building and shipping machine learning products on distributed platforms using both AWS and GCP. She especially loves working with open source projects and is a contributor to Kubeflow. Michelle holds a master’s degree from the University of Cambridge.

Presentations

Kubeflow explained: Portable Machine Learning on Kubernetes Session

Learn how to build a Machine Learning application with Kubeflow, which makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere. Kubeflow supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Find out what Kubeflow currently supports and the long-term vision for the project, presented by a project contributor.

Amber Case is the director of Esri’s R&D Center in Portland, where she works on open source developer tools and next-generation location-based technology. Previously, Amber was the CEO of and cofounder of Geoloqi, a location-based software company acquired by Esri in 2012. She is an advocate of privacy, data ownership, and calm technology. You can follow Amber on Twitter or at Caseorganic.com.

Presentations

Keynote with Amber Case Keynote

Amber Case, Director R&D, Esri

Sarah Catanzaro is a Principal at Amplify Partners where she focuses on investing in high-potential startups that leverage machine intelligence and high-performance computing to solve real-world problems. Shes joins Amplify from Canvas Ventures, where she co-led investments in Kinetica, Platform9, and Fluxx. Sarah has several years of experience in developing data acquisition strategies and leading machine and deep learning-enabled product development at organizations of various sizes. Most recently, as Head of Data at Mattermark, she led a team to collect and organize information on over one million private companies. Previously, she implemented analytics solutions for municipal and federal agencies as a consultant at Palantir and as an analyst at Cyveillance. She also directed projects on adversary behavioral modeling and Somali pirate network analysis as a program manager at the Center for Advanced Defense Studies. Sarah holds a BA in international security studies from Stanford University.

Presentations

VC trends in machine learning and data science Session

In this panel, venture capital investors will discuss how startups can accelerate enterprise adoption of machine learning and what new tech trends will give rise to the next transformation in the Big Data landscape.

Mark is a hacker at H2O. He was previously in the finance world as a quantitative research developer at Thomson Reuters and Nipun Capital. He also worked as a data scientist at an IoT startup, where he built a web based machine learning platform and developed predictive models.

Mark has a MS Financial Engineering from UCLA and a BS Computer Engineering from University of Illinois Urbana-Champaign. In his spare time Mark likes competing on Kaggle and cycling.

Presentations

Practical Techniques for Interpreting Machine Learning Models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Vinoth Chandar works on data infrastructure at Uber, with a focus on Hadoop and Spark. Vinoth has keen interest in unified architectures for data analytics and processing. Previously, Vinoth was the LinkedIn lead on Voldemort and worked on Oracle server’s replication engine, HPC, and stream processing.

Presentations

Hudi : Unifying storage & serving for batch & near real-time analytics Session

Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers share the design, architecture & use-cases of the second generation of ‘Hudi’, an analytical storage engine designed to serve such needs and beyond.

Manna Chang is senior data scientist in Optum Enterprise Analytics, where she plays a leading role in providing and developing innovative technologies/methods to meet customer needs and answer healthcare-related challenges. She holds Ph.D. in Biochemistry and MS in Statistics. Her past experience in applying machine learning techniques in drug discovery and genomic outcome studies led to the current role in data science. For the hobbies, she loves sci-fi movies and enjoys hiking.

Presentations

Breaking the rules: End Stage Renal Disease Prediction Session

This presentation will focus on showing both supervised and unsupervised learning methods to work with claims data and how they can complement each other. A supervised method will look at CKD patients at-risk to develop ESRD, and unsupervised approach will look at classification of patients that tend to develop this disease faster than others.

Danny is currently a software engineer at Uber on the Hadoop Platform team working on large scale data ingestion and dispersal pipelines and libraries leveraging Apache Spark. He also previously was the tech lead at Uber Maps building data pipelines to produce metrics to help analyze the quality of mapping data. Before joining Uber, Danny was at Twitter and an original member of the core team building Manhattan, a key/value store powering Twitter’s use cases.

Danny has a B.S in computer science from UCLA and a M.S in computer science from USC.

Presentations

Marmaray – A generic, scalable, and pluggable Hadoop data ingestion & dispersal framework Session

Marmaray is a generic Hadoop ingestion and dispersal framework recently released to production at Uber. We will introduce the main features of Marmaray and business needs met, share how Marmaray can help a team's data needs by ensuring data can be reliably ingested into Hive or dispersed into online data stores, and give a deep dive into the architecture to show you how it all works.

Felix, a PMC & Committer of Apache Spark, started his journey in the big data space about 5 years ago with the then state-of-the-art MapReduce. Since then, he (re-)built Hadoop clusters from metal more times than he would like, created a Hadoop “distro” from two dozens or so projects, and juggled hundreds to thousands of cores in the cloud or in data centers. He built a few interesting app with Apache Spark for 3.5 years and ended up contributing to it for more than 3 years. In addition to building stuff, he frequently presented in conferences, meetups, or workshops. He was also a teaching assistant of the first set of edx MOOCs on Apache Spark.

Presentations

Your 5 billions rides are arriving now - scaling Apache Spark for data pipelines and intelligent systems at Uber Session

Do you know how your Uber rides are powered by Apache Spark? Come to this talk to learn how Uber builds data platform with Apache Spark at enormous scale, what unique challenges we face and overcome.

Kapil is a Sr. Product Manager at Amazon Web Services with focus on real-time machine learning on high-volume and high-velocity data. He also runs the streaming data ingestion business at AWS, Kinesis Data Firehose. Previously at Akamai Technologies, he led the analytics business and launched and scaled multiple new products including real-time video monitoring services (Media Analytics and QoS Monitor) and the award-winning broadcast operations as a service (BOCC).

Presentations

Continuous machine learning over streaming data, the story continues. Session

Understand how unsupervised learning can provide insights into streaming data, with new applications to impute missing values, to forecast future values, detect hotspots and perform classification tasks, and how to efficiently implement to operate in real-time over massive data streams.

Anant Chintamaneni is VP of products at BlueData. Anant has more than 15 years experience in business intelligence, advanced analytics, and big data infrastructure. He is currently responsible for product management at BlueData, where he focuses on helping enterprises deploy big data technologies including Hadoop and Spark. Prior to BlueData, Anant led the product management team for Pivotal’s big data suite.

Presentations

What's the Hadoop-la about Kubernetes? Session

There is increased interest in using Kubernetes (K8s), the open-source container orchestration system for modern Big Data workloads. The promised land is a unified platform for cloud native stateless and stateful data services. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the considerations for Big Data services for K8s.

Erin joined Airbnb in 2011 and is the company’s most tenured data scientist. She has led data science and analytics initiatives across the company, including work with Customer Experience, Legal, Communications, and Public Policy. Currently, she is the data scientist for Airbnb’s Human team, which has a mission to house people in need, including evacuees of disasters and refugees. In 2016, she co-founded Data University, a company-wide data training program, of which over a quarter of the company has participated. Prior to Airbnb, she worked in education consulting and program management in Washington, DC. Erin received a PhD in Economics from Georgia State University and a BA in Mathematics Education and Economics from Anderson University (IN). Erin is a proud Airbnb Superhost, having welcomed nearly 1000 guests since 2011. In her spare time she enjoys traveling, reading, pub trivia, and golfing.

Presentations

Data University: How Airbnb Democratized Data Session

Airbnb has open-sourced many high-leverage data tools: Airflow, Superset, and the Knowledge Repo. However, adoption of these tools across Airbnb was relatively low. To make data more accessible and utilized in decision-making, Airbnb launched Data University in early 2017. Since the launch, over a quarter of the company has participated in the program, and data tool utilization rates have doubled.

Mark has been building Web applications since his first image map for his band’s page in 1995 and working with computers long before that. He is currently working at Viacom with talented engineers working with data and machine learning. He also helped to initiate Viacom’s open source program.

Presentations

Agility to Data Product Development: Plug and Play Data Architecture Session

Data Products, different from Data-Driven Products, are finding their own place in organizational Data. Driven Decision Making. Shifting the focus to “data”, opens up new opportunities. The presentation, with case studies, dives deeper into a layered implementation architecture, provides intuitive learnings and solutions that allow for more agile, reusable data modules for a data product team.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, Ian was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand Your Data Science and Machine Learning Skills (Python, R, SQL, Spark, TensorFlow) 1-Day Training

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, with different syntaxes, conventions, and terminology. The instructor will simplify the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, participants will overcome obstacles to getting started using new tools.

Lawrence is a Partner and Advanced Analytics Practice Leader with Cicero Group. Lawrence has spent the last decade building Cicero’s analytics practice where he has experience helping Fortune 500 firms solve real business challenges with data, including attrition, segmentation, sales prioritization, pricing, and customer satisfaction. He also leads the firm in predictive analytics and Big Data related engagements, applying Cicero’s deep expertise in strategy execution to ensure data delivers ROI. He has partnered with companies to help them to shift from reactive to predictive analytics by collecting and analyzing real-time information and distributing it across the organization— allowing management to make better, faster decisions that move the business forward.

Lawrence is a frequent speaker and thought leader in the advanced analytics space, speaking at events such as Predictive Analytics World for Business and Workforce, Global Big Data Conference, as well as serving as chairperson for the Data Analytics Leaders Event – the place where data chiefs, BI and analytics function heads come together to explore accelerating the path of data-to-value. His views and recommendations on Big Data, and advanced analytics have been published in CIO Review and Predictive Analytics Times.

Lawrence holds a Master’s of Science in Predictive Analytics from Northwestern University, an MBA with an emphasis in Business Economics from Westminster College, and a BA from Brigham Young University.

Presentations

Realizing the true value in your data: Data-drivenness Assessment Session

We've worked with firms and seen over and over that they are struggling to leverage their data. We've developed a methodology for assessing 4 critical areas that firms must consider when looking to make the analytical leap: Data Strategy; Data Culture; Data Analysis & Implementation; Data Management & Architecture.

Dan Crankshaw is a PhD student in the CS Department at UC Berkeley, where he works in the RISELab. After cutting his teeth doing large-scale data analysis on cosmology simulation data and building systems for distributed graph analysis, Dan has turned his attention to machine learning systems. His current research interests include systems and techniques for serving and deploying machine learning, with a particular emphasis on low-latency and interactive applications.

Presentations

Model serving and management at scale using open-source tools Tutorial

This tutorial consists of three parts. First, I will present an overview of the current challenges in deploying machine applications into production and provide a survey of the current state of prediction serving infrastructure. Next, I will provide a deep dive on the Clipper serving system. Finally, I will run a hands-on workshop for getting started with Clipper.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Data Case Studies welcome Tutorial

Welcome to the Data Case Studies tutorial.

Findata welcome Tutorial

Program Chair, Alistair Croll, welcomes you to Findata Day.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday Keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Umur is co-founder and CEO of Citus Data, a leading Postgres company whose mission is to make it so companies never have to worry about scaling their relational database again.

Umur has over 15 years of experience driving complex enterprise software, IT, and database initiatives at large enterprises and at different startups—and he earned a master’s in Management Science & Engineering from Stanford University.

As CEO of Citus Data, Umur wears both operational and strategic hats. Umur works directly with technical founders at SaaS companies to help them scale their multi-tenant applications, and with enterprise architects to power real-time analytics apps that need to handle large-scale data.

Umur’s team at Citus Data is active in the Postgres community, sharing expertise and contributing key components and extensions. Umur’s company open sourced their distributed database extension for PostgreSQL, in early 2016.

Umur has a deep interest in how scalable systems of record and systems of engagement can help businesses grow—and is excited about the past, present, and future state of Postgres.

Presentations

The state of Postgres Session

PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases.

Paul Curtis is a principal solutions engineer at MapR, where he provides pre- and postsales technical support to MapR’s worldwide systems engineering team. Previously, Paul served as senior operations engineer for Unami, a startup founded to deliver on the promise of interactive TV for consumers, networks, and advertisers; systems manager for Spiral Universe, a company providing school administration software as a service; senior support engineer positions at Sun Microsystems; enterprise account technical management positions for both Netscape and FileNet; and roles in application development at Applix, IBM Service Bureau, and Ticketron. Paul got started in the ancient personal computing days; he began his first full-time programming job on the day the IBM PC was introduced.

Presentations

Clouds and Containers: Case Studies for Big Data Session

Now that the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? In this discussion explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday Keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Olga Cuznetova is a data science team lead in Optum Enterprise Analytics, where she guides junior team members on their projects and helps implement data science solutions that address healthcare business needs. Currently, her projects focus mostly on building disease progression and clinical operations models, a few examples include: predicting high-cost diabetic patients, prediction of progression to end stage renal disease, implementing substance abuse disorder model using external clients’ data, and predicting medical prior authorization outcomes. Prior joining Optum Enterprise Analytics team Olga completed a one-year Technology Development Program with a focus on the development of essential technical skills, healthcare business acumen, and analytical skill set, that led Olga choose a data science career path. Olga holds BS in Finance from Central Connecticut State University. When Olga has a spare moment, you can find her traveling both in the United States and abroad.

Presentations

Breaking the rules: End Stage Renal Disease Prediction Session

This presentation will focus on showing both supervised and unsupervised learning methods to work with claims data and how they can complement each other. A supervised method will look at CKD patients at-risk to develop ESRD, and unsupervised approach will look at classification of patients that tend to develop this disease faster than others.

Michelangelo D’Agostino is the senior director of data science at ShopRunner, where he leads a team that develops statistical models and writes software that leverages their unique cross-retailer e-commerce dataset. Michelangelo came to ShopRunner from Civis Analytics, a Chicago-based data science software and consulting company that spun out of the 2012 Obama re-election campaign. At Civis, he led the data science R&D team. Prior to that, he was a senior analyst in digital analytics with the 2012 Obama re-election campaign, where he helped to optimize the campaign’s email fundraising juggernaut and analyzed social media data. Michelangelo has been a mentor with the Data Science for Social Good Fellowship. He holds a PhD in particle astrophysics from UC Berkeley and got his start in analytics sifting through neutrino data from the IceCube experiment. Accordingly, he spent two glorious months at the South Pole, where he slept in a tent salvaged from the Korean War and enjoyed the twice-weekly shower rationing. He’s also written about science and technology for the Economist.

Presentations

The Care and Feeding of Data Scientists: Concrete Tips for Retaining Your Data Science Team Session

Data scientists are hard to hire. But too often, companies struggle to find the right talent only to make avoidable mistakes that cause their best data scientists to leave. From org structure and leadership to tooling and infrastructure to continuing education, this talk will offer concrete (and inexpensive) tips for keeping your data scientists engaged, productive, and adding business value.

Zavain is driven by smart software, leveraging data and machine intelligence to scale, augment, and balance human intelligence. He invests in companies that are using machine learning and AI to augment or replace physical-world functions including biology, language, manufacturing and analysis. He looks for entrepreneurs that can use software and data to hone a philosophical position on where the world is, and how to direct it for the better.

Zavain has led Lux’s investments in Primer, a machine intelligence startup; Clarifai, which democratizes cutting edge deep neural networks; Capella, which is developing novel medicines based on computational insight applied to genomic data; Recursion which uses automation and deep learning to develop drugs for rare diseases; Tempo Automation, which applies software and automation to electronics manufacturing; Rigetti Computing, which is fabricating some of the fastest quantum chips in the world; Visor, which aims to simplify tax preparation; and Blockstack which builds architectures to decentralize current winner-take-all centralized web components.

Prior to VC, Zavain was a founder and computer scientist. At Discovery Engine (acquired by Twitter), he engineered machine learning and AI systems across a proprietary distributed computing framework to build web scale-ranking algorithms. Zavain was also a cofounder of Fountainhop, one of the first hyper-local social networks. Zavain holds a BS in symbolic systems and an MS in computer science from Stanford where he was a researcher in Stanford’s AI Lab. He is currently a Lecturer at Stanford and has taught quarter long seminars in Cryptocurrencies, Artificial Intelligence and Philosophy, and Venture Capital.

Presentations

VC trends in machine learning and data science Session

In this panel, venture capital investors will discuss how startups can accelerate enterprise adoption of machine learning and what new tech trends will give rise to the next transformation in the Big Data landscape.

Milene Darnis is a Data Product Manager at Uber, focusing on building a world-class experimentation platform. From her role as a Product Manager and her previous experience at Uber as a Data Engineer, she has developed a passion for linking data to concrete business problems.
Previously, Milene was a Business Intelligence Engineer at a mobile gaming company. She holds a Master’s Degree in Engineering from Telecom ParisTech, France.

Presentations

A/B testing at Uber: how we built a BYOM (Bring Your Own Metrics) platform Session

Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis, who leads Product Management for the Experimentation platform, will talk about how the team built a scalable and self-serve platform, that lets users plug in any metric to analyze.

Kaushik is a Partner/CTO at Novantas where he is responsible for technology strategy and R&D roadmap of a number of cloud-based platforms. He has 15+ years of experience leading large engineering teams to develop scalable, high performance analytics platforms. He graduated from the University of Pennsylvania with a MS in Engineering, University of Missouri with a MS in Computer Science, and Carnegie Mellon University with a MS in Computational Finance.

Presentations

Case Study : A Spark-based Distributed Simulation Optimization Architecture for Portfolio Optimization in Retail Banking Session

We discuss a large-scale optimization architecture in Spark for a consumer product portfolio optimization case study in retail banking—which combines a simulator that distributes computation of complex real-world scenarios given varying macro-economic factors, consumer behavior and competitor landscape, and a constraint optimizer that uses business rules as constraints to meet growth targets.

Ifi Derekli is a Systems Engineer at Cloudera, focusing on helping enterprises solve big data problems using Hadoop technologies. Prior to Cloudera, Ifi was a Presales Technical Consultant at Hewlett-Packard Enterprise where she provided technical expertise for Vertica and IDOL (currently part of Micro Focus). She holds a B.S. in Electrical Engineering and Computer Science from Yale University.

Presentations

Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.

Ding Ding is a software engineer on Intel’s big data technology team, where she works on developing and optimizing distributed machine learning and deep learning algorithms on Apache Spark, focusing particularly on large-scale analytical applications and infrastructure on Spark.

Presentations

A Deep Learning Approach for Precipitation Nowcasting with RNN using BigDL on Spark Session

Precipitation nowcasting is used to predict the future rainfall intensity over a relatively short timeframe. The forecasting resolution and time accuracy required are much higher than other traditional forecasting tasks. We will talk about building a precipitation nowcasting system with recurrent neural networks using BigDL on Apache Spark.

Harish Doddi is counder and CEO of Datatron Technologies. Previously, he held roles at Oracle; Twitter, where he worked on open source technologies, including Apache Cassandra and Apache Hadoop, and built Blobstore, Twitter’s photo storage platform; Snapchat, where he worked on the backend for Snapchat stories; and Lyft, where he worked on the surge pricing model. Harish holds a master’s degree in computer science from Stanford, where he focused on systems and databases, and an undergraduate degree in computer science from the International Institute of Information Technology in Hyderabad.

Presentations

Infrastructure for deploying machine learning to production: lessons and best practices in large financial institutions Session

Large financial institutions have many data science teams (for eg: fraud, credit risk, marketing) often using diverse set of tools to build predictive models. There are many challenges involved Production-zing these predictive AI models. This talk will cover challenges and lessons learnt deploying AI models to production in large financial institutions.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive Briefing: Getting Your Data Ready for Heavy EU Privacy Regulations (GDPR ) Session

General Data Protection Regulation (GDPR) goes into effect in May 2018 for firms doing any business in the EU. However many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.

Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.

Florian Douetteau is the CEO of Dataiku, a company democratising access to Data Science.
Starting programming in his early childhood, he dropped the prestigious “Ecole Normale” Maths courses to start working at 20 in a startup that later became Exalead, a search engine company in the early days of the french startup community. His subjects of interests include data, artificial intelligence, and how tech can improve the daily work life of tech people.

Presentations

Executive Briefing: Profit from AI and Machine Learning – The best practices for people & process Session

Ovum will present the results of research cosponsored by Dataiku, surveying a specially selected sample of chief data officers and data scientists, on how to map roles and processes to make success with AI in the business repeatable.

James Dreiss is a Senior Data Scientist at Reuters. He studied at New York University and the London School of Economics, and previously worked at the Metropolitan Museum of Art in New York.

Presentations

Document Vectors in the Wild: Building a Content Recommendation System for Reuters.com Session

A discussion of the challenges in building a content recommendation system for one of the largest news sites in the world, Reuters.com. The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content.

Chiny Driscoll is MetiStream’s founder and CEO. MetiStream is a provider of real-time integration and analytic services in the Big Data arena.

Chiny has more than 24 years of management and executive leadership experience in the technology industry, having served in a variety of roles with Fortune 500 tech companies.

Prior to founding MetiStream, Chiny was the Worldwide Executive Leader of Big Data Services for IBM’s Information Management division. There, she led all of the professional services which implemented and supported IBM’s Big Data products and solutions across industries such as financial services, communications, public sector and retail. Key solutions included streaming, analytics, Hadoop, and DW appliance related solutions.

Before IBM, Chiny was the VP and General Manager of Netezza, a leader in Big Data warehouse appliances and advanced analytics which was acquired by IBM in 2010. Preceding her work building Netezza’s services and education organization, Chiny held various global and regional leadership roles at TIBCO Software. Chiny’s last position at TIBCO was running the pre-sales, services and sales operations for the Public Sector division. Prior to TIBCO she served in services leadership roles at EDS and other services and technology companies.

Presentations

Digging for Gold: Developing AI in healthcare against unstructured text data - exploring the opportunities and challenges Session

This Cloudera/MetiStream solution lets healthcare providers automate the extraction, processing and analysis of clinical notes within the Electronic Health Record in batch or real-time. Improve care, identify errors, and recognize efficiencies in billing and diagnoses by leveraging NLP capabilities to conduct fast analytics in a distributed environment. Use case by Rush University Medical Center.

Carolyn Duby, a Hortonworks Solutions Engineer, is dedicated to helping her customers harness the power of their data with Apache open source platforms. A subject matter expert in cyber security and data science, Carolyn is an active leader in the community and frequent speaker at Future of Data meetups in Boston, MA and Providence, RI and at conferences such as Open Data Science Conference and Global Data Science Conference. Prior to joining Hortonworks she was the architect for cyber security event correlation at SecureWorks. Ms Duby earned a ScB Magna Cum Laude and ScM from Brown University in Computer Science. She is life long learner and recently completed the Johns Hopkins Univerity Coursera Data Science Specialization.

Presentations

Apache Metron: Open Source Cyber Security at Scale Tutorial

Learn how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable, open-source platform. After this interactive overview of the platform's major features, you will be ready to analyze your own haystack back at the office.

Ted Dunning is chief application architect at MapR Technologies. He’s also a board member for the Apache Software Foundation, a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects, and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He has contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Progress for Big Data in Kubernetes Session

Stateful containers are a well-known antipattern, but the standard answer of managing state in a separate storage tier is costly and complex. Recent developments have changed things dramatically for the better. In particular, you can now manage a high-performance software defined storage tier entirely in Kubernetes. I will describe what's new and how it makes big data easier on Kubernetes.

Brent is the Director of Data Strategy at Domo and has over 14 years of enterprise analytics experience at Omniture, Adobe, and Domo. He is a regular Forbes contributor on data-related topics and has published two books on digital analytics, including Web Analytics Action Hero. In 2016, Brent received the Most Influential Industry Contributor Award from the Digital Analytics Association (DAA). He has been a popular presenter at multiple conferences such as Shop.org, Adtech, Pubcon, and Adobe Summit. Brent earned his MBA from Brigham Young University and his BBA (Marketing) degree from Simon Fraser University. Follow him on Twitter @analyticshero.

Presentations

Stories Beat Statistics: How to Master the Art and Science of Data Storytelling Session

With companies collecting all kinds of data and using advanced tools and techniques to find insights, they often fail in the last mile--communicating insights effectively to drive change. This session will look at the power that stories wield over statistics and explore the art and science of data storytelling—an essential skill that everyone must have in today’s data economy.

Barbara Eckman is a Principal Data Architect at Comcast. She leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing Big Data. Barbara is a recognized technical innovator in Big Data architecture and governance, as well as scientific data and model integration. Her experience includes technical leadership positions at a Human Genome Project Center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.

Presentations

Data Discovery and Lineage: Integrating streaming data in the public cloud with on-prem, classic datastores and heterogeneous schema types Session

Comcast’s Streaming Data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. We were recently challenged to integrate on-prem datasources, including traditional data warehouses and RDBMS’s. Our data governance strategy must now include relational and JSON schemas in addition to Apache Avro. Here’s how we did it!

Laura Eisenhardt is EVP at iKnow Solutions Europe and the founder of DigitalConscience.org, a CSR platform designed to create opportunities for technical resources (specifically expats) to give back to communities with their unique skills while making a huge impact locally. Laura has led massive programs for the World Health Organization across Africa, collecting big data in over 165 languages, and specializes in data quality and consistency. Laura is also COO for the American Institute of Minimally Invasive Heart Surgery (AIMHS.org), a nonprofit designed to educate the public and heart surgeons worldwide on how to do open heart surgery without splitting open the chest. Why? People that have complex heart surgery in a minimally invasive procedure return to work in two weeks versus 9–12 months, which has a substantial impact on society, family finances, depression, and cost for all.

Presentations

GDPR and the Australian Privacy Act – Forcing the Legal and Ethical Hands of How Companies Collect, Use and Analyze Data Findata

Data brings unprecedented insights to industries about customer behavior & personal data is being harvested. We know more about our customers and neighbors then at any other time in history but need to avoid "crossing the creepy line". Governance and Security experts from Cloudera, Mastercard and iKnow solutions discuss how ethical behavior drives trust especially in today's IoT age

Jacob Eisinger joined Talroo in 2015.  As the Director of Data, Jacob is responsible for the Special Projects initiative to pilot and validate high impact business models and technologies.  Previously at Talroo, Jacob lead search, personalization, data warehouse, bot detection, and machine learning.  Before that, Jacob worked in the Emerging Technologies group at IBM where he worked with technologies like BlueMix, Apache Spark, Apache Kafka, OAuth, and web service standards.  Jacob is also an accomplished inventor with over 20 patent applications.  Jacob holds a bachelor’s degree from Virginia Tech in computer science.

Presentations

Job recommendations leveraging Deep Learning on Apache Spark with BigDL Session

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? In this session, we will demonstrate how to leverage BigDL on Apache Spark (A Distributed Deep learning framework for Apache Spark*) to predict a candidate’s probability of applying to specific jobs based on their resume.

Jonathan is CTO and co-founder at DataStax and the founding project chair of Apache Cassandra. Previously, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy.

Presentations

Cassandra vs Cloud Databases Session

Is open-source Apache Cassandra still relevant in an era of hosted cloud databases? DataStax CTO Jonathan Ellis will discuss Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.

Nick Elprin is the cofounder and CEO of Domino Data Lab, a data science platform that accelerates the development and deployment of models while enabling best practices like collaboration and reproducibility. Previously, Nick built tools for quantitative researchers at Bridgewater, one of the world’s largest hedge funds. He has over a decade of experience working with data scientists at advanced enterprises. Nick holds a BA and MS in computer science from Harvard.

Presentations

Managing Data Science in the Enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Moty Fania is a principal engineer for big data analytics at Intel IT and the CTO of the Advanced Analytics Group, which delivers big data and AI solutions across Intel. With over 15 years of experience in analytics, data warehousing, and decision support solutions, Moty leads the development and architecture of various big data and AI initiatives, such as IoT systems, predictive engines, online inference systems, and more. Moty holds a bachelor’s degree in economics and computer science and a master’s degree in business administration from Ben-Gurion University.

Presentations

A high-performance system for deep learning inference and visual inspection Session

In this session, Moty Fania will share Intel’s IT experience from implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming and online actuation. This session highlights the key learnings from this work with a thorough review of platform’s architecture

Usama is founder/CEO of Open Insights where he worked with large and small enterprises on AI, BigData strategy, and launching new business models, most recently serving as Interim CTO for Stella.AI, a VC-funded startup in AI for HR/recruiting; and Interim COTO of MTN2.0 — helping develop new revenue streams in mobile payments/MFS at MTN, Africa’s largest mobile operator. Usama was the first Global Chief Data Officer at Barclays in London (2013-2014) after he launched the largest tech startup accelerator in MENA (2010-2013) as Executive Chairman of Oasis500 in Jordan. His background includes Chairman/CEO roles at several startups, including Blue Kangaroo Corp, DMX Group and digiMine (Audience Science). He was the first person ever to hold the Chief Data Officer title when Yahoo! acquired his second startup in 2004. He held leadership roles at Microsoft (1996-2000) and founded the machine learning systems group at NASA’s Jet Propulsion Laboratory (1989-1995) where his work on machine learning resulted in the top Excellence in Research award from Caltech, and a U.S. Government medal from NASA. Usama has published over 100 technical articles on data mining, data science, AI/ML, and databases. He holds over 30 patents, is a Fellow of the Association for Advancement of Artificial Intelligence and a Fellow of the Association of Computing Machinery. Usama earned his Ph.D. in engineering in AI/Machine Learning from the University of Michigan. He holds two BSE’s in Engineering, MSE Computer Engineering and M.Sc. in Mathematics.

Presentations

Next Generation Cybersecurity via Data Fusion, AI and BigData: Pragmatic Lessons from the Font Lines in Financial Services Session

This presentation will share the main outcomes and learnings from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on BigData and AI at a major EU bank and in collaboration with several financial services institutions. The focus is on learnings and breakthroughs gleaned from making the systems work

Dr. William “Bill” Fehlman is a Data Scientist Lead at USAA, who provides machine learning tools and guidance that create efficiencies and greater effectiveness in contact center operations through actionable insights generated from analytics. Bill came to USAA after working as a Senior Analytics Consultant with Clarity Solution Group. Prior to working with Clarity, he served as an Automation & Robotics Systems Engineer at NASA Langley Research Center. Before that, he served 23 years in the US Army in numerous leadership and operations research roles, to include Assistant Professor and Director of the Differential Calculus Program at the US Military Academy. Bill’s academic credentials include a PhD in Applied Science with a concentration in machine learning from the College of William & Mary, MS in Applied Mathematics from Rensselaer Polytechnic Institute, and BS in Mathematics from SUNY Fredonia.

Presentations

An Intuitive Explanation for Approaching Topic Modeling Session

Provide a comparison of topic modeling algorithms used to identify latent topics in large volumes of text data, and then present coherence scores that illustrate the method that shows high consistency with human judgments on the quality of topics. We will then discuss the importance of the coherence scores in choosing topic modeling algorithms that best support different use cases.

Stephanie Fischer has many years of consulting experience in big data, machine learning and human-centric innovation. As Product Owner, she develops services and products based on machine learning and content analytics. She is speaker at conferences, author of articles on Big Data/machine learning and founder of datanizing GmbH.

Presentations

From chaos to insight: Automatically derive value from your user-generated content Data Case Studies

Whether customer e-mails, product reviews, company wikis or support communities – user-generated content (UGC) as a form of unstructured text is everywhere – and it’s growing exponentially. This stimulates the desire of many companies for an automated evaluation ("treasure hunt”). Similar techniques have been successfully used for structured information (data warehouses) for quite a while.

Brian Foo is a senior software engineer in Google Cloud working on applied artificial intelligence, where he builds demos for Google Cloud’s strategic customers, as well as open source tutorials to improve public understanding of AI. Brian previously worked at Uber, where he trained machine learning models and built large scale training and inference pipeline for mapping and sensing/perception applications using Hadoop/Spark. Prior to that, Brian headed the real-time bidding optimization team at Rocket Fuel, where he worked on algorithms that determined millions of ads shown every second across many platforms such as web, mobile, and programmatic TV. Brian received a B.S. in EECS from Berkeley, and a Ph.D. in EE Telecommunications from UCLA.

Presentations

From Training to Serving: Deploying Tensorflow Models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Join this tutorial to learn how to bring deep learning models from training to serving in a cloud production environment. You will learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Janet Forbes is an experienced Enterprise, Business and Senior Systems Architect with deep understanding of data, functional and technical architecture and proven ability to define, audit and improve business processes based on best practices. She has extensive experience in leading multi-functional teams through the planning and delivery of complex solutions.
With over 25 years of experience in various roles and organizations, Janet has proven capability in enterprise, functional and technical architecture with specific focus on Business and Data Architecture. As a trusted advisor, Janet works closely with clients in assessing and shaping their data strategy practices.

Presentations

From Theory to Data Product - Applying Data Science Methods to Effect Business Change Tutorial

This tutorial is a primer on crafting well-conceived data science projects on course toward uncovering valuable business insights. Using case studies and hands-on skills development, we will teach techniques that are essential for a variety of audiences invested in effecting real business change.

Jean-Michel Franco is Director of Product Marketing for Talend’s Data Governance solutions. He has dedicated his career to developing and broadening the adoption of innovative technologies in companies. Prior to joining Talend, he started out at EDS (now HP) by creating and developing a business intelligence (BI) practice, joined SAP EMEA as Director of Marketing Solutions in France and North Africa, and then lately Business & Decision as Innovation Director. He authored 4 books and regularly publishes articles, presents at events and tradeshows.

Presentations

Enacting The Data Subjects Access Rights For GDPR With Data Services And Data Management Session

GDPR is more than another regulation to be handled by your back office. Enacting the Data Subject Access Rights (DSAR) requires practical actions. In this session, we will discuss the practical steps to deploy governed data services

Bill Franks is Chief Analytics Officer for The International Institute For Analytics (IIA). Franks is also the author of Taming The Big Data Tidal Wave and The Analytics Revolution. His work has spanned clients in a variety of industries for companies ranging in size from Fortune 100 companies to small non-profit organizations. You can learn more at http://www.bill-franks.com.

Presentations

Analytics Maturity: Industry Trends And Financial Impacts Session

The International Institute For Analytics studied the analytics maturity level of large enterprises. The talk will cover how maturity varies by industry and some of the key steps organizations can take to move up the maturity scale. The research also correlates analytics maturity with a wide range of corporate success metrics including financial and reputational measures.

Michael J. Freedman is the cofounder and CTO of TimescaleDB, an open source database that scales SQL for time-series data, and Professor of Computer Science at Princeton University, where his research focuses on distributed systems, networking, and security.

Previously, Michael developed CoralCDN (a decentralized CDN serving millions of daily users) and Ethane (the basis for OpenFlow and software-defined networking) and cofounded Illuminics Systems (acquired by Quova, now part of Neustar). He is a technical advisor to Blockstack.

Michael’s honors include the Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), the SIGCOMM Test of Time Award, the Caspar Bowden Award for Privacy Enhancing Technologies, a Sloan Fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, a DARPA Computer Science Study Group membership, and multiple award publications. He holds a PhD in computer science from NYU’s Courant Institute and bachelor’s and master’s degrees from MIT.

Presentations

Performant time-series data management and analytics with Postgres Session

I describe how to leverage Postgres even for high-volume time-series workloads using TimescaleDB, an open-source time-series database built as a Postgres plugin. I explain its general architectural design principles, as well as new time-series data management features including adaptive time partitioning and near-real-time continuous aggregations.

Chris Fregly is founder and research engineer at PipelineAI, a San Francisco-based streaming machine learning and artificial intelligence startup. Previously, Chris was a distributed systems engineer at Netflix, a data solutions engineer at Databricks, and a founding member of the IBM Spark Technology Center in San Francisco. Chris is a regular speaker at conferences and meetups throughout the world. He’s also an Apache Spark contributor, a Netflix Open Source committer, founder of the Global Advanced Spark and TensorFlow meetup, author of the upcoming book Advanced Spark, and creator of the O’Reilly video series Deploying and Scaling Distributed TensorFlow in Production.

Presentations

Building a High Performance Model Serving Engine from Scratch using Kubernetes, GPUs, Docker, Istio, and TensorFlow Session

Applying my Netflix experience to a real-world problem in the ML and AI world, I will demonstrate a full-featured, open-source, end-to-end TensorFlow Model Training and Deployment System using the latest advancements with Kubernetes, TensorFlow, and GPUs.

Brandy Freitas is a research physicist-turned-data scientist based in Boston, MA. Her academic research focused primarily on protein structure determination, applying machine learning techniques to single-particle cryo-electron microscopy data. She is currently a Principal Data Scientist at Pitney Bowes, where she works with clients in a wide variety of industries to develop analytical solutions for their business needs. Brandy is a National Science Foundation Graduate Research Fellow, a James Mills Pierce Fellow, and holds an SM in Biophysics from Harvard University.

Presentations

Executive Briefing: Analytics for Executives - Building an Approachable Language to Drive Data Science in Your Organization Session

Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. In this session, Harvard Biophysicist-turned-Data Scientist, Brandy Freitas, will work with participants to develop context and vocabulary around data science topics to help build a culture of data within their organization.

I am a senior global Executive and have managed a number of implementation projects for small and large companies over the past decade+. I have also initiated and directed a number of AI, OR and Optimization R&D projects. I have developed, transferred and established best practices and cutting edge technology in many industries, including Retail, Distribution, Manufacturing, Call-Centers, Healthcare, Airports Services, Security.

- Current CEO of Element AI
- Chief Innovation / Products Officer and Head of JDA Labs, a position I assumed following the successful sale of my company, Planora, in July of 2012
- CEO and co-founder of Planora, bringing to market a highly disruptive SAAS solution in the extremely competitive Workforce Management space by combining a unique blend of AI, Machine Learning, Operation Research and user experience.
- Co-founder and Director of Products for Logiweb, a leading custom web development firm focusing on web-based analytics and decision support tools acquired by Innobec

Presentations

From Data Governance to AI Governance: The CIO's new role Session

The CIO is going to need a broader mandate in the company to better align their AI training and outcomes with business goals and compliance. This mandate should include an AI Governance team that is well staffed and deeply established in the company in order to catch biases that can develop from faulty goals or flawed data

Navdeep is a Hacker Scientist at H2O.ai. He graduated from California State University, East Bay with a M.S. degree in Computational Statistics, B.S. in Statistics, and a B.A. in Psychology (minor in Mathematics). During his education he gained interests in machine learning, time series analysis, statistical computing, data mining, & data visualization.

Previous to H2O.ai he worked at a couple start ups and Cisco Systems, Inc. focusing on data science, software development, and marketing research. Before that, he was a consultant at FICO working with small to mid level banks in the U.S. & South America focusing on risk management across different bank portfolios (car loan, home mortgage, and credit card). Before stepping into industry he worked in various Neuroscience labs as a researcher/analyst. These labs were at institutions such as UC Berkeley, UCSF, and Smith Kettlewell Eye Research Institute. His work across these labs varied from behavioral, electrophysiology, and functional magnetic resonance imaging research.

In his spare time Navdeep enjoys watching documentaries, reading (mostly non-fiction or academic), and working out.

Presentations

Practical Techniques for Interpreting Machine Learning Models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Harry founded Periscope Data in 2012 with co-founder Tom O’Neill. The two have grown Periscope Data to serve nearly 1000 customers. Glaser was previously at Google, and graduated from the University of Rochester with a bachelor’s degree in computer science.

Presentations

An ethical foundation for the AI-driven future Session

What is the moral responsibility of a data team today? As AI & machine learning technologies become part of our everyday life, and as data becomes accessible to everyone, CDOs and data teams are taking on a very important moral role as the conscience of the corporation. This session will highlight the risks companies will face if they don't empower data teams to lead the way for ethical data use.

Zachary Glassman is a data scientist in residence at the Data Incubator. Zachary has a passion for building data tools and teaching others to use Python. He studied physics and mathematics as an undergraduate at Pomona College and holds a master’s degree in atomic physics from the University of Maryland.

Presentations

Hands-On Data Science with Python 1-Day Training

The Data Incubator offers a foundation in building intelligent business applications using machine learning. We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into an application using a real-world dataset.

Heitor Murilo Gomes is a researcher at Télécom ParisTech focusing on machine learning—particularly evolving data streams, concept drift, ensemble methods, and big data streams. He co-leads the StreamDM open data stream mining project.

Presentations

Machine learning for non-stationary streaming data using Structured Streaming and StreamDM Session

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. This talk will cover how StreamDM can be used alongside Structured Streaming for build incremental models specially for non-stationary streams (i.e. those with concept drifts). Concretely, we will cover how to develop, apply and evaluate learning models using StreamDM and Structured Streaming.

Bruno Gonçalves is a Moore-Sloan fellow at NYU’s Center for Data Science. With a background in physics and computer science, Bruno has spent his career exploring the use of datasets from sources as diverse as Apache web logs, Wikipedia edits, Twitter posts, epidemiological reports, and census data to analyze and model human behavior and mobility. More recently, he has been focusing on the application of machine learning and neural network techniques to analyze large geolocated datasets.

Presentations

Recurrent Neural Networks for timeseries analysis Tutorial

The world is ever changing. As a result, many of the systems and phenomena we are interested in evolve over time resulting in time evolving datasets. Timeseries often display any interesting properties and levels of correlation. In this tutorial we will introduce the students to the use of Recurrent Neural Networks and LSTMs to model and forecast different kinds of timeseries.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Near-real time Anomaly Detection at Lyft Session

Consumer facing real-time processing poses a number of challenges to protect from fraudulent transactions and other risks. The streaming platform at Lyft seeks to support this with an architecture that brings together a data science friendly programming environment with a deployment stack for the reliability, scalability and other SLA requirements of a mission critical stream processing system.

Sudipto Guha is principal scientist at Amazon Web Services, where he studies the design and implementation of a wide range of computational systems, from resource-constrained devices, such as sensors, to massively parallel and distributed systems. Using an algorithmic framework, Sudipto seeks to design systems that are correct, efficient, and optimized despite their bidirectional asymptotic scale and seeming lack of similarity to human information processes. His recent work focuses on clustering and location theory, statistics and learning theory, database query optimization and mining, approximation algorithms for stochastic control, communication complexity, and data stream algorithms.

Presentations

Continuous machine learning over streaming data, the story continues. Session

Understand how unsupervised learning can provide insights into streaming data, with new applications to impute missing values, to forecast future values, detect hotspots and perform classification tasks, and how to efficiently implement to operate in real-time over massive data streams.

Sumit Gulwani is a Partner Research manager at Microsoft, leading the PROSE research and engineering team that develops APIs for program synthesis (programming by examples and natural language) and incorporates them into real products. He is the inventor of the popular Flash Fill feature in Microsoft Excel used by hundreds of millions of people. He has published 120+ peer-reviewed papers in top-tier conferences/journals across multiple computer science areas, delivered 40+ keynotes/invited talks at various forums, and authored 50+ patent applications (granted and pending). He is a recipient of the prestigious ACM SIGPLAN Robin Milner Young Researcher Award, ACM SIGPLAN Outstanding Doctoral Dissertation Award, and the President’s Gold Medal from IIT Kanpur.

Presentations

Programming by input-output Examples Session

Programming by input-output examples (PBE) is a new frontier in AI, set to revolutionize the programming experience for the masses. It can enable end users (99% of whom are non-programmers) to create small scripts, and make data scientists 10-100x more productive for many data wrangling tasks. Come learn about this new programming paradigm: its applications, form factors, the science behind it.

Patrick Hall is a senior director for data science products at H2o.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning.

Previously, Patrick held global customer-facing and R&D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the eleventh person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Practical Techniques for Interpreting Machine Learning Models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Zachary Hanif is a director in Capital One’s Center for Machine Learning, where he leads teams focused on applying machine learning to cybersecurity and financial crime. His research interests revolve around applications of machine learning and graph mining within the realm of massive security data and the automation of model validation and governance. Zachary graduated from the Georgia Institute of Technology.

Presentations

Network Effects: Working with Modern Graph Analytic Systems Session

Modern frameworks and analytical techniques are making graph analysis methods viable for increasingly large complex tasks, while an understanding of graph based analytical techniques can be extremely powerful when applied to modern practical problems. This talk examines three prominent graph analytic methods, including graph convolutional networks, and applies them to concrete use cases.

Dan Harple is the founder CEO of Context Labs, which is based near MIT and Kendall Square in Cambridge, MA, with offices in Amsterdam and India. Context Labs is a leader in delivering at-scale enterprise blockchain-enabled systems, and in advising global market segments, and countries, on the development of highly efficient ecosystems and interoperable standards, to accelerate positive change for stakeholders.
Companies:
A technology entrepreneur for more than 25 years, Dan has founded and built technologies, companies, and products that have been used by billions of Internet users, merging companies with Netscape Communications (establishing key standards for Internet collaboration, streaming, and VoIP), Oracle (provided core big data and collaborative capabilities for Oracle Fusion- now with over 65,000 customers), and a joint venture with China’s Sina (providing core underlying technology for its Weibo platform, with over 600 million daily users). Recent work at Context Labs has resulted in work taking blockchain-enabled platforms from Proof-of-Concept (POC) stage to at-scale production, with reference deployments in global printing/publishing, global environmental data, and cybersecurity.
Each of Dan’s firms successfully raised multiple rounds of Silicon Valley-based venture capital, and had liquidity events at various stages. He has been a founder and CEO of technology companies, a senior exec and/or CEO in three NASDAQ-listed tech companies, and an advisor and investor in many others, including the acting Chief Innovation Strategy Officer at RR Donnelley.
Influence:
Harple has had a seminal influence on the commercial Internet thanks to his pioneering work with him company, InSoft, in voice over internet protocol (VoIP), streaming media, and interactive screen sharing/shared whiteboards, in the early ‘90s. He has been behind innovations driving a range of patents that are some of the most cited for: collaborative computing, VoIP, streaming media, real-time web communications, big data integration, and location-based social media. His influence underlies technologies that power Skype, GoToMeeting, Webex, SmartBoards, Oracle Fusion, Sina Weibo, and YouTube, among others. Recent work centers on big data, blockchain, and supply chain analytics.
Recent efforts have brought blockchain technology into the mainstream, as well as coupled it to sustainable impact investing methods, aimed at developing bridges between industrialized sources of environmental climate change, and methods to address it. Harple’s MIT-developed PentalyticTM network graph theory technology and method also has enabled the global collaboration of fragmented industries, via the architecture and co-founding of the following global initiatives:
• Music Industry: bringing over 300 leading firms (e.g., Warner Music, Universal Music Group, SONY Music Group, Spotify, Soundcloud, IBM, Intel, Sonos, et al) to collaborate to deploy interoperable standards with the Open Music Initiative (open-music.org). (http://open-music.org/blog/2016/9/2/dan-harples-pentalytic-framework-guides-the-omi)
• Automotive Industry: May 2018 saw the launch of MOBI (Mobility Open Blockchain Initiative, dlt.mobi), bringing together major automotive firms and their ecosytems (e.g., BMW, Ford, General Motors, Groupe Renault, Bosch, ZF, Accenture, IBM, Context Labs, et al), with Harple’s co-founding of MOBI with the CFO of Toyota. (https://www.newswire.com/news/major-automakers-startups-technology-companies-and-others-launch-20456598)
• Country of The Netherlands: Advisor to the country, shaping the architecture and framing of its national entrepreneurial program, known as StartupDelta (https://www.startupdelta.org)
Development and Influence of At-Scale Standards:
Standards often develop as result of market penetration and adoption. Dan’s efforts in these areas have resulted in the development and adoption of standards that impact billions of users on the Internet. This was the case for the subsequent standards adopted for real time media streaming (RTSP), which is still deployed and in use for media streaming on the Internet. Dan’s early efforts in pioneering Voice over IP (VoIP) shaped standards for H.323, Real-time Transport Protocol (RTP), and influenced the Skype protocol, which was a derivative of the first peer-to-peer based media platform from InSoft/Netscape, OpenDVE (Open Digital Video Everywhere). The ground-breaking integration of traditional telephony with the Internet Protocol (IP), was also developed by Dan and his team, integrating the world’s largest selling PBX from ATT/Lucent with the Internet, allowing callers from traditional phones to place calls to callers using VoIP on the Internet, and vice versa.
Academic & Research:
Dan’s work in scaling technologies and influencing and developing standards has always been in the context of vibrant ecosystems. To further extend and develop a platform to accelerate the leverage of network effects in ecosystems, Dan performed

Detailed Bio: Dan Harple

ContextLABS
research at MIT. Here, he developed a new model called PentalyticsTM, described in the thesis, “Toward a Network Graph-based Cluster Density Index.” Concurrent with this, Dan founded and developed a program at MIT called REAL (Regional Entrepreneurial Acceleration Lab), designed to be a living lab to learn and experiment with ecosystems, which has been in operation every Fall at MIT Sloan since 2012.
Informed by the MIT research, Dan founded Context Labs, which uses big data analytics to describe the growth of innovation ecosystems and clusters, intersecting this method with transformative uses of blockchain-based technologies. These early efforts partnered with the MIT Media Lab’s work on City Science in the Changing Places Group. He’s collaborated on several Media Lab Courses: Beyond Smart Cities (2013) and Changing Cities: How to Prototype New Urban Systems (2014). While at MIT, He concurrently served as an entrepreneur in residence (EIR).
Influence in Development of Market & Technical Ecosystems:
The Pentaltyic research resulted in a big-data network graph-driven platform called Innovation Scope (iScopeTM), which provides a data-driven Pentalytic “Lens” to assess the resilience and potential for targeted ecosystems. The method and platform has been utilized in the following scenarios:
• Country/Regional/National Innovation Strategies: with the OECD examining innovation strategies, in the Netherlands, to provide the foundation for the development of the Dutch national “Startup Delta” initiative, as a tool to shape the City of Amsterdam in developing an innovation strategy. (https://www.startupdelta.org)
• Strategic Expansion Strategies: Utilized by Berklee College of Music in considering strategic locations
• Oil/Energy/Gas: Providing a deep data experience for the hydraulic fracking ecosystem
• Global Music & Entertainment: Used as the foundational method for the Open Music Initiative (OMI), open-music.org,
which today has over 300 global members, including Spotify, Pandora, Warner Music, Universal Music Group, and many
more. (http://open-music.org/blog/2016/9/2/dan-harples-pentalytic-framework-guides-the-omi)
• Blockchain/Mobility: Methods and structuring guidance to the co-founder of the “mobi” initiative while he was at Toyota. Secured the name “mobi” and its branding for the consortia. (https://www.newswire.com/news/major-automakers-
startups-technology-companies-and-others-launch-20456598)
• Climate: Building a global blockchain-enabled consortium to address climate change, advised and provided technology
to the UN Global Climate Initiative, #hack4climate.
Investor:
As an experienced technology entrepreneur and inventor, Dan has participated at all stages of the investing lifecycle, from Angel to venture, initial public offerings, secondary offerings, PIPEs, debt/bridge financings. He has been an LP in a variety of venture firms, including New Enterprise Associates (NEA), Charles River Ventures, Adams Capital Management, and Goldman Sachs. Over 100 companies invested in include, but are not limited to: Webex, Juniper Networks, Verisign, Broadview Networks, Force 10 Networks, Vonage Holdings, Netezza, Tableau, Wercker (Oracle), and many others.
Non-Profit:
He has served as a director and/or advisor for a variety of nonprofits and educational institutions: Berklee College of Music, Stichting Nexuslabs Foundation, International School of Amsterdam, Tabor Academy, University of Rhode Island College of Engineering Advisory Board, Friends Academy, Marlboro College Advisor to the President, Harrisburg Academy.
He implemented a Pentalytic model for Impact Investing, recently developing an investment strategy and funding model with the Grantham Trust (Grantham Foundation for the Protection of the Environment), The Environmental Defense Fund (EDF), and other leading global NGOs and Foundations focused on climate and sustainability.
January 2018 saw the launch of a new Context Labs company, SphericalAnalytics.io, in partnership with the Grantham Trust, to provide the world’s trusted source for environmental data, utilizing key cryptographic and blockchain-enabled technologies.
Awards, Publications & Books:
He has received numerous awards, including Inc. Magazine’s Entrepreneur of the Year Award and the NEA (New Enterprise Associates) President’s Award.

Detailed Bio: Dan Harple

ContextLABS
The book by the Wall Street Journal’s Thomas Petzinger, “The New Pioneers: The Men and Women Who Are Transforming the Workplace and Marketplace,” (Simon & Schuster), describes the pioneering work done by Dan and his team for Voice over IP (VoIP), real-time collaboration, and Internet video streaming. Two Wall Street Journal profiles also appeared in the mid-90’s.
Michael Casey’s recent bestseller (April 2018), “The Truth Machine, The Blockchain and the Future of Everything” describes the work Dan is leading in the global blockchain market in data veracity and efforts to provide industry-wide APIs enabling interoperability in complex ecosystems such as automotive, supply chain, etc.
He was a co-author with Internet pioneer Vint Cerf, for the book “Disrupting Unemployment,” focusing on technology’s impact on employment and the economy. He was featured in “Raising Can-Do Kids: Giving Children the Tools to Thrive in a Fast- Changing World,” (Random House LLC), where authors Richard Rende and Jen Prosek chose to kick-off this new book with Chapter One, “Wired for Exploration,” discussing Dan’s views on raising children in our digital world.
He also has published in a variety of conference proceedings (ANSYS Conference) dealing with the application of CAE/CAD/CAM and finite element analysis (FEA) in distributed computing environments, specifically in the field of mechanical design and ergonomics.
Education:
He holds degrees from MIT (M.Sc) and the University of Rhode Island (Mechanical Engineering & Psychology) and he also attended Marlboro College. He is currently on the Board of Trustees for the Berklee College of Music.
Patents:
Several of the innovations developed by Dan and his teams serve as core foundational technical innovation for how the Internet developed since the mid 1990’s until today in 2018. The first patent (Apparatus for Collaborative Computing), is typically cited by other patents in the areas of VoIP, Streaming, Shared Virtual Workspaces, and has been assigned over time to Netscape, AOL, Microsoft/Skype, and most recently, Facebook.

Presentations

Architectural Principles For Building Trusted, Real Time, Distributed IoT Systems Session

Data analysts, engineers and scientists are rapidly becoming the center of more IoT design, development, and deployment. As more enterprises learn about the value creation potential of trusted, contextualized machine data to be used across multiple applications and entities, data professionals are becoming the key partners for operations, IT and risk professionals relying on these systems.

Kenji Hayashida is a data engineer at Recruit Lifestyle co., ltd. in Japan.
He holds a Master degree in information engineering from Osaka University.

Kenji started his carrer as a software engineer at HITECLAB while he was in college.
After joining Recruit Group, Kenji has involved in many projects such as advertising technology, contents marketing and data pipeline.

In his pastime, Kenji enjoys programing competitins such as TopCoder, Google Code Jam and Kaggle.
He is also a writer of a popular data science textbook.

Presentations

Best Practices to Develop an Enterprise Datahub to Collect and Analyze 1TB/day Data from a Lot of Services with Apache Kafka and Google Cloud Platform in Production Session

Recruit Group and NTT DATA Corporation developed a platform based on "Datahub" utilizing Apache Kafka. This platform should handle around 1TB/day application logs generated by a lot of services in Recruit Group. Some of the best practices and know-hows, such as schema evolution and network architecture, learned during this project are explained in this session.

Jeff is Trifacta’s Chief Experience Officer, Co-founder and a Professor of Computer Science at the University of Washington, where he directs the Interactive Data Lab. Jeff’s passion is the design of novel user interfaces for exploring, managing and communicating data. The data visualization tools developed by his lab (D3.js, Protovis, Prefuse) are used by thousands of data enthusiasts around the world. In 2009, Jeff was named to MIT Technology Review’s list of “Top Innovators under 35”.

Presentations

The Vega Project: Building an Ecosystem of Tools for Interactive Visualization Session

Introduces Vega and Vega-Lite -- high-level declarative languages for interactive visualization that support exploratory data analysis, communication, and the development of new visualization tools.

Sam Helmich is a Data Scientist in the John Deere’s Intelligent Solutions Group. He has worked in applied analytics roles within John Deere Worldwide Parts and Global Order Fulfillment, and has a MS in Statistics from Iowa State University.

Presentations

Data Science in an Agile Environment: Methods and Organization for Success Data Case Studies

Data science can benefit by borrowing some principles of Agile. These benefits can be compounded by structuring the team roles in such a manner to enable success without relying on employing full-stack expert “unicorns”.

Alex is a software engineer with the analytics group at Cray focused on deep learning technologies. His team works to develop applications for HPC users to readily incorporate data analytics and machine learning tools into their workflow.

Presentations

A Deep Learning Approach for Precipitation Nowcasting with RNN using BigDL on Spark Session

Precipitation nowcasting is used to predict the future rainfall intensity over a relatively short timeframe. The forecasting resolution and time accuracy required are much higher than other traditional forecasting tasks. We will talk about building a precipitation nowcasting system with recurrent neural networks using BigDL on Apache Spark.

Camila Hiskey is a Senior Systems Engineer at Cloudera. A hands-on technologist, she architects enterprise data solutions primarily for large financial services and life sciences organizations. Camila helps educate IT & business teams on Hadoop, open source software and big data. Prior to Cloudera, she worked with operational data stores and analytical databases at IBM, as an engineer and DBA.

Presentations

Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

Protecting sensitive data in huge datasets: Cloud tools you can use Session

Before releasing a public dataset, practitioners need to thread the balance between utility and protection of individuals. In this talk we'll move from theory to real-life while handling massive public datasets. We'll showcase newly available tools that help with PII detection, and bring concepts like k-anonymity and l-diversity to a practical realm.

Garrett Hoffman is a Senior Data Scientist at StockTwits, where he leads efforts to use data science and machine learning to understand social dynamics and develop research and discovery tools that are used by a network of over one million investors. Garrett has a technical background in math and computer science but gets most excited about approaching data problems from a people-first perspective–using what we know or can learn about complex systems to drive optimal decisions, experiences, and outcomes.

Presentations

Deep Learning Methods for Natural Language Processing Tutorial

This workshop will review deep learning methods used for natural language processing and natural language understanding tasks while working on a live example with StockTwits data using python and TensorFlow. Methods we review include Word2Vec, Recurrent Neural Networks and Variants (LSTM, GRU) and Convolutional Neural Networks.

Anthony is a senior software engineer working on the Data Management team at LinkedIn, where he works on LinkedIn’s data access layer, Dali, and has contributed to Apache Hive and Pig. He holds a B.S. in Computer Science from Yale University.

Presentations

Enforcing GDPR Compliance at Scale Session

With over 100 million LinkedIn members in the EU, enforcing GDPR compliance is challenging. In this talk, we explain the architecture of our system and how we leverage Hive, Kafka, Gobblin, and WhereHows to ensure compliance.

Keqiu is a Staff Software Engineer at LinkedIn, he used to work on the mobile infrastructure at LinkedIn and recently moved to work on big data platforms.

Presentations

TonY -- Native support of TensorFlow on Hadoop Session

We have developed TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. TonY's native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop including MapReduce and Spark.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.

Presentations

Why and how to leverage the power and simplicity of SQL on Apache Flink Session

Fabian Hueske discusses why SQL is a great approach to unify batch and stream processing. He gives an update on Apache Flink's SQL support and shares some interesting use cases from large-scale production deployments. Finally, Fabian presents Flink's new query service that enables users and applications to submit streaming and batch SQL queries and retrieve low-latency updated results.

Data Science and Machine Learning consultant at Microsoft. Previously Machine Learning student at Cambridge, Engineering student in Ghent.

Presentations

Democratising deep learning with transfer learning Session

Transfer learning allows data scientists to leverage insights from large labelled data sets. The general idea of transfer learning is to use knowledge learned from tasks for which a lot of labelled data is available in settings where only little labelled data is available. In this talk, you’ll learn what transfer learning is and how it can boost your NLP or CV pipelines.

Sr. Software Engineer at LinkedIn on the Hadoop development team.

Presentations

TonY -- Native support of TensorFlow on Hadoop Session

We have developed TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. TonY's native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop including MapReduce and Spark.

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Scalable Machine Learning for Data Cleaning Session

Machine-learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.

Maryam is a research scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She obtained her PhD from the Icahn School of Medicine at Mount Sinai (New York) for her studies on molecular regulators of organ size control. Maryam’s long-term research goal is to reduce bias in decision-making by using a combination of computation linguistics, machine learning and behavioral economics methods.

Presentations

‘Moneyballing’ Recruiting: A Data-Driven Approach to Battling Bottlenecks and Biases in Hiring Data Case Studies

Hiring teams have long-relied on intuition and experience to scout talent. Increased data and data-science techniques give us a chance to test common recruiting wisdom. Maryam will draw on results from her recent behavioral experiments and analyses of over 10 million jobs and their outcomes to illustrate how often innocuous recruiting decisions have dramatic impacts on hiring outcomes.

Ankit currently works as a Data Scientist at Uber where his primary focus is on forecasting using Deep Learning methods and self driving car’s business problems.Prior to that, he has worked in a variety of data science roles at Runnr, Facebook, BofA and Clearslide. Ankit holds a Masters from UC Berkeley and BS from IIT Bombay (India).

Presentations

Achieving Personalization with LSTMs Session

Personalization is a common theme in social networks and e-commerce businesses. However, personalization at Uber will involve understanding of how each driver/rider is expected to behave on the platform. In this talk, we will focus on how Deep Learning (LSTM's) and Uber's huge database can be used to understand/predict future behavior of each and every user on the platform.

Jeroen Janssens is the founder and CEO of Data Science Workshops, which provides on-the-job training and coaching in data visualization, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Presentations

50 reasons to learn the shell for doing data science Session

"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems.

Data Science with Unix Power Tools Tutorial

The unix command line remains an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful, command-line tools you can quickly scrub, explore, and model your data as well as hack together prototypes. This hands-on workshop is based on the O’Reilly book Data Science at the Command Line, written by instructor Jeroen Janssens.

Dr. Theresa Johnson is a Product Manager on Metrics and Forecasting products at Airbnb. As a data scientist, she worked on the taskforce and cross-functional hackathon team at Airbnb who worked to develop the framework for the current Anti-discrimination efforts. Theresa joined Airbnb after earning a PhD in Aeronautics and Astronautics from Stanford University.

She is a founding board member of Street Code Academy, a non-profit dedicated to high touch technical training for inner city youth, and has been featured in TechCrunch for her commitment to helping early-stage founders raise capital. Her lifelong fascination with the capacity for technology to change lives led her to Stanford University, where she earned dual undergraduate degrees in Science, Technology and Society and Computer Science. Theresa is passionate about extending technology access for everyone and finding mission driven companies that can have an outsized impact on the world.

Presentations

Revenue Forecasting Platform at Airbnb Findata

How Airbnb builds its next generation end to end Revenu Forecasting Platform leveraging Machine Learning, Bayesian Inference, Tensorflow, Hadoop, and Web technology.

Omkar is a software engineer on Uber’s hadoop platform team and is currently architecting Marmaray. Omkar has keen interest in solving large scale distributed systems problems. Previously Omkar has lead object store and NFS solutions at Hedvig Inc and was an initial contributor to Hadoop’s YARN scheduler.

Presentations

Marmaray – A generic, scalable, and pluggable Hadoop data ingestion & dispersal framework Session

Marmaray is a generic Hadoop ingestion and dispersal framework recently released to production at Uber. We will introduce the main features of Marmaray and business needs met, share how Marmaray can help a team's data needs by ensuring data can be reliably ingested into Hive or dispersed into online data stores, and give a deep dive into the architecture to show you how it all works.

All-day-long-coding-CTO at Attendify. Clojure, Haskell, Rust. Fields of interest: algebras and protocols. Author of Muse and Fn.py libraries. An active contributor to Aleph and other OS projects.

Presentations

Managing Data Chaos in the World of Microservices Session

When we talk about microservices we usually focus on the communication layer and rarely on the data. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity: structural and semantic changes, knowledge sharing and data discovery. We'll discuss emerging technologies created to tackle these challenges.

Atul Kale is a Software Engineer on Airbnb’s Machine Learning Infrastructure team. He majored in Computer Engineering at the University of Illinois Urbana-Champaign; prior to joining Airbnb, he worked in finance building and deploying machine-learning driven proprietary trading strategies as well as data pipelines to support them.

Presentations

Bighead: Airbnb's End-to-End Machine Learning Platform Session

We introduce Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Bighead integrates popular libraries including Tensorflow, XGBoost, and PyTorch. It is built on Python, Spark, and Kubernetes, and is designed be used in modular pieces. It has reduced the overall model development time from many months to days at Airbnb.

Daniel Kang is a PhD student in the Stanford InfoLab, where he is supervised by Peter Bailis and Matei Zaharia. Daniel’s research interests lie broadly at the intersection of machine learning and systems. Currently, he is working on deep learning applied to video analysis.

Presentations

BlazeIt: An Exploratory Video Analytics Engine Session

As video volumes grow, automatic methods are required to prioritize human attention. However, these methods do not scale and are cumbersome to deploy. In response, we introduce BlazeIt, an exploratory video analytics engine. We show our declarative language, FrameQL, can capture a range of real-world queries and BlazeIt's optimizer can execute these queries over 2000x faster than naive approaches.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work she enjoys playing with fire, riding scooters, and dancing.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Yasuyuki Kataoka is a data scientist at NTT Innovation Institute Inc. His primary interest is applied R&D in machine learning applications for time-series and heterogeneous data such as vision, audio, text and IoT sensor signals. This data science work spans various fields including automotive, sports, healthcare and social media. Other areas of interest include robotics control such as self-driving car and drone system. When not doing research activities, he likes to participate in hackathons where he has won prizes in automotive and healthcare industry. He earned MS and BS in Mechanical and System Engineering from Tokyo Institute of Technology with valedictorian honor. While a full-time employee, he is also a Ph.D. candidate in Artificial Intelligence field at the University of Tokyo.

Presentations

Real-time machine intelligence in IndyCar and Tour de France Session

One of the challenges of sports data analytics is how to deliver machine intelligence beyond a mere real-time monitoring tool. This session highlights various real-time machine learning models in both IndyCar and Tour de France. This talk encompasses real-time data processing architecture, machine learning model, and demonstration that delivers meaningful insights for players and fans.

Mubashir Kazia is a principal solutions architect at Cloudera and an SME in Apache Hadoop security in Cloudera’s Professional Services practice, where he helps customers secure their Hadoop clusters and comply to internal security policies. He also helps new customers transition to Hadoop platform and implement their first few use cases and trains and mentors peers in Hadoop and Hadoop security. Mubashir has worked with customers from all verticals, including banking, manufacturing, healthcare, telecom, retail, and gaming. Previously, he worked on developing solutions for leading investment banking firms.

Presentations

Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.

Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Correlation analysis on live data streams Session

One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. We shall walk the audience through how to marry correlation analysis with anomaly detection can help and share techniques to guide effective decision making.

Designing Modern Streaming Data Applications Tutorial

In this tutorial, we will walk the audience through the landscape of state-of-the-art systems for each stage of a end-to-end data processing pipeline, viz., messaging frameworks, streaming computing frameworks, storage frameworks for real-time data. We will also walk through case studies from IoT, Gaming and Healthcare, and share our experiences operating these systems at Internet scale.

Jawad Khan is Director of Data Sciences and Knowledge Management at Rush University Medical Center. In this role, he will leverage his extensive experience to lead Rush’s Analytics and data strategy.

Jawad is passionate about leading the analytics program at Rush. He focuses on leveraging data from all sections of the business including clinical, ERP, security, device sensors, and people/patient generated data. In turn, the integrated data analytic will provide improved safety, better clinical outcomes, reduce cost and drive innovation.

Jawad brings to Rush more than 20 years of experience in analytics, software development, data management, and data security. Prior to joining Rush, Jawad provided Cloud enablement strategies for data and applications to clients like GE Capital, Coke, Proctor & Gamble, and Warner Bros working as a Lead Architect at Century Link.

Prior to that, Jawad served as a Managing Director at Opus Capital Markets where he was responsible for leading analytics, data security and compliance, and software development. At Opus, Jawad was also responsible for data center and infrastructure development and operations.

Jawad graduated from Southern Illinois University with a degree in Computer Engineering and proceeded to work as a Software Engineer consultant for one of the big 6 consulting firms. He also speaks regularly at professional and community events and is a Cricket commentator for NPR affiliate Public Radio, WBEZ, in Chicago.

Presentations

Digging for Gold: Developing AI in healthcare against unstructured text data - exploring the opportunities and challenges Session

This Cloudera/MetiStream solution lets healthcare providers automate the extraction, processing and analysis of clinical notes within the Electronic Health Record in batch or real-time. Improve care, identify errors, and recognize efficiencies in billing and diagnoses by leveraging NLP capabilities to conduct fast analytics in a distributed environment. Use case by Rush University Medical Center.

Amandeep Khurana is Chief Executive Officer and Co-founder at Cerebro Data, which he launched in 2016 with CTO and co-founder Nong Li. After witnessing first-hand the challenges companies faced in Big Data and Cloud migration, he built Cerebro Data to empower all users with easy access through a unified, secured, and governed platform across heterogenous data stores.

While supporting customer cloud initiatives at Cloudera and playing an integral role at AWS on the Elastic MapReduce team, Amandeep oversaw some of the industry’s largest big data implementations. As such, he understands that customers need self-serve analytics without trading in governance or security. Amandeep is the co-author of HBase in Action, a book on building applications by HBase and is passionate about distributed systems, big data and everything cloud.

Amandeep received his MS in Computer Science from the University of Santa Cruz and a Bachelor in Engineering at Thapar Institute of Engineering and Technology.

Presentations

The Move to a Modern Data Platform in the Cloud. Pitfalls to Avoid and Best Practices to Follow Session

Critical data management practices for easy and unified data access that meets security and regulatory compliance

James Kirkland is the advocate for Red Hat’s initiatives and solutions for the internet of things (IoT) and is the architect of Red Hat’s strategy for IoT deployments. This open source architecture combines data acquisition, integration, and rules activation with command and control data flows among devices, gateways, and the cloud to connect customers’ operational technology environments with information technology infrastructure and provide agile IoT integration. James serves as the head subject-matter expert and global team leader of system architects responsible for accelerating IoT implementations for customers worldwide. Through his collaboration with customers, partners, and systems integrators, Red Hat has grown its IoT ecosystem, expanding its presence in industries including transportation, logistics, and retail and accelerating adoption of IoT in large enterprises. James has a deep knowledge Unix and Linux variants that spans the course of his 20-year career at Red Hat, Racemi, and Hewlett-Packard. He is a steering committee member of the IoT working group for Eclipse.org, a member of the IIC, and a frequent public speaker and author on a wide range of technical topics.

Presentations

Using Machine Learning to Drive Intelligence at the Edge Session

The focus on IoT is turning increasingly to the edge. And the way to make the edge more intelligent is by building machine learning models in the cloud and pushing those learnings back out to the edge. Join Cloudera and Red Hat where they will showcase how they executed this architecture at one of the world’s leading manufacturers in Europe, including a demo highlighting this architecture.

Spencer Kirn is a PhD student in the Applied Science Department at the College of William & Mary. His research interests include: applications of topic modeling, deep learning, and other AI methods to respond to real world problems as well as understanding what is happening inside these complicated algorithms to fully understand predictions. Spencer graduated from the College of Wooster in 2016 with a BA in Physics before arriving at the College of William & Mary.

Presentations

An Intuitive Explanation for Approaching Topic Modeling Session

Provide a comparison of topic modeling algorithms used to identify latent topics in large volumes of text data, and then present coherence scores that illustrate the method that shows high consistency with human judgments on the quality of topics. We will then discuss the importance of the coherence scores in choosing topic modeling algorithms that best support different use cases.

Rita Ko is the Director of The Hive, the innovation lab at the UN Refugee Agency in the United States (USA for UNHCR.) She heads the application of machine learning and data science to explore new modes of engagement around the global refugee crisis. Her work in data science stems from her election campaign experience in Canada at the Office the Mayor in the City of Vancouver where she successfully re-elected Mayor Gregor Robertson three consecutive terms, and has worked on three national election campaigns applying predictive modeling. Rita earned her Masters of Business Administration from Cornell University.

Presentations

From Strategy to Implementation — Putting Data to Work at USA for UNHCR Session

The Hive and Cloudera Fast Forward Labs share how they transformed USA for UNHCR (UN Refugee Agency) to use data science and machine learning (DS/ML) to address the refugee crisis. From identifying use cases and success metrics to showcasing the value of DS/ML, we cover the development and implementation of a DS/ML strategy hoping to inspire other organizations looking to derive value from data.

Andreas Kohlmaier joined Munich Re in 2008 as an IT architect. He is currently heading the Data Engineering team in Munich which is setting up the group wide data lake and supporting the transformation of Munich Re to a data driven organization. He holds a Master in Computer Science and has more than 15 years of experience in IT and data projects. His main areas of expertise are micro services, data management, IT architecture and agile project management.

Presentations

Cataloging the Data Lake for Distributed Analytics Innovation at MunichRe Findata

MunichRe is increasing client esilience against economic, political and cyber risks while setting and shaping trends in the insurance market. Recently MunichRe successfully launched a data catalog as the driver for analyst adoption of a data lake. Cataloging new data encouraged users to effectively and collaboratively explore new ideas, develop new business and enhance customer service.

As Google Cloud’s Chief Decision Scientist, Cassie Kozyrkov is passionate about helping everyone – Google, its customers, the world! – make better decisions through harnessing the beauty and power of data. She speaks at conferences and meets with leadership teams to empower decision-makers to transform their industries through AI, machine learning, and analytics.

At Google, Cassie has advised more than a hundred teams on statistics and machine learning, working most closely with Research & Machine Intelligence, Google Maps, and Ads & Commerce. She has also personally trained more than 15,000 Googlers (executives, engineers, scientists, and even non-technical staff members) in machine learning, statistics, and data-driven decision-making.

Prior to joining Google, Cassie spent a decade working as a data scientist and consultant. She is a leading expert in decision science, with undergraduate studies in statistics and economics (University of Chicago) and graduate studies in statistics, neuroscience, and psychology (Duke University and NCSU).

When she’s not working, you’re most likely to find Cassie at the theatre, in an art museum, exploring the world, playing board games, or curled up with a good novel.

Presentations

Executive Briefing: Most Data-Driven Cultures, Aren’t Session

Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness and hiring experts doesn’t seem to help. This session examines what it takes to build a truly data-driven organizational culture and highlights a vital, yet often neglected, job function: the data science manager.

Jay Kreps is the co-founder and CEO of Confluent, the company behind the popular Apache Kafka streaming platform. Previously, Jay was one of the primary architects for LinkedIn, where he focused on data infrastructure and data-driven products. He was among the original authors of a number of open source projects in the scalable data systems space, including Voldemort (a key-value store), Azkaban, Kafka (a distributed streaming platform), and Samza (a stream processing system).

Presentations

Apache Kafka and the four challenges of production machine learning systems Session

Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. This talk will explain some of the difficulties of building production machine learning systems and talk about how Apache Kafka and stream processing can help.

In the information technology industry since January 1995 (about 15 years in IT). Have extensive IT exposure and as senior BI Architect and has extensive experience in systems analysis, Architecture and administration of BI tools and data warehousing applications.

Handled Major Telecom, Banking, Sales & Finance, HCM Projects and have over 9 years of MIS and Business Information Reporting, he has Strong architectural and technical knowledge in BI area with tools like BI/BW, BO, Cognos, Informatica, Crystal reports and with major db’s like Oracle, Sybase, MSSQL, Teradata & Access, also he has hands on experience in Client-Server technology with tools like Visual Basic, Power builder, Developer 2k and FoxPro , Web Technology such as XML, UTML, ASP, Jscript and has handled on-site off shore projects and have lead off shore resources

With various Data analysis and Data modeling, Customized business solution, I have shown outstanding interpersonal and communication skills and has been a worthy contributor to any project team. I have drive and determination to succeed and has proved valuable on many projects that required strong analysis skills and rapid acquisition of new technologies and application details as per business need.

Most Recent Significant Accomplishments

- SAP Hana and Coludera Hadoop Integration for Big Data Analysis and Mining
- Analysis Platform Delivery for
Supply Chain Management,
Production Operations ,
Demand Planning and
Sustainment
-On Schedule, on Budget delivery of BI solutions for Enterprise Data Warehouse solutions

Presentations

Self-Service Modern Analytics on the GovCloud Session

Lockheed Martin is a data driven company that has a massive variety and volume of data. To extract the most value from our information assets, we’re constantly exploring ways to enable effective self-service scenarios. To address these challenges we are building an analytics platform with a focus on leveraging AWS GovCloud. Join us to understand our journey into modern analytics.

Abhishek Kumar is a manager of data science in Sapient’s Bangalore office, where he looks after scaling up the data science practice by applying machine learning and deep learning techniques to domains such as retail, ecommerce, marketing, and operations. Abhishek is an experienced data science professional and technical team lead specializing in building and managing data products from conceptualization to deployment phase and interested in solving challenging machine learning problems. Previously, he worked in the R&D center for the largest power-generation company in India on various machine learning projects involving predictive modeling, forecasting, optimization, and anomaly detection and led the center’s data science team in the development and deployment of data science-related projects in several thermal and solar power plant sites. Abhishek is a technical writer and blogger as well as a Pluralsight author and has created several data science courses. He is also a regular speaker at various national and international conferences and universities. Abhishek holds a master’s degree in information and data science from the University of California, Berkeley.

Presentations

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Manoj Kumar is a senior software engineer working at Linkedin in data team. He is currently working on auto tuning hadoop jobs. He has more than 4 years of experience in big data technologies like hadoop, mapreduce, spark, hbase, pig, hive, kafka, gobblin etc. Before joining Linekdin he has worked with PubMatic for more than 4 years. At PubMatic he was working on data framework for slicing and dicing (30 dimensions, 50 metrics) advertising data, which gets more than 20TB of data everyday. Before that he had worked for Amazon for more than 18 months.

Presentations

TuneIn: How to get your jobs tuned while you are sleeping Session

Have you ever tuned a Spark or MR job? If the answer is yes, then you already know how difficult it is to tune more than hundred parameters to optimize the resources used. With Dr. Elephant we introduced heuristic based tuning recommendations. Now, we introduce TuneIn, an auto tuning tool developed to minimize the resource usage of jobs. Experiments have shown upto 50% reduction in resource usage.

Data Scientist at H2O.ai

Presentations

Practical Techniques for Interpreting Machine Learning Models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. He specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. In addition to his client-facing consulting and training, Jared is an adjunct professor of statistics at Columbia University and the organizer of the New York Open Statistical Programming Meetup and the New York R Conference. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world and was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Modeling Time Series in R Session

Temporal data is being produced in ever greater quantity and fortunately so do our time series capabilities. We look at a number of techniques for modeling time series. We start with traditional methods such as ARMA then go over more modern tools such as Prophet and machine learning models like XGBoost and Neural Nets. Along the way we look at a bit of theory and code for training these models.

Paul Lashmet is practice lead and advisor for financial services at Arcadia Data, a company that provides visual big data analytics software that empowers business users to glean meaningful and real-time business insights from high volume and varied data in a timely, secure, and collaborative way. Paul writes extensively about the practical applications of emerging and innovative technologies to regulatory compliance. Previously, he led programs at HSBC, Deutsche Bank, and Fannie Mae.

Presentations

Visualize AI to Spot New Trading Opportunities Findata

Artificial intelligence and deep learning are used to generate and execute trading strategies today. Meanwhile, regulators and investors demand transparency into investment decisions. The challenge is that the decision-making processes of machine learning technologies are opaque. The opportunity is that these same machines generate data that can be visualized to spot new trading opportunities.

Francesca Lazzeri is a data scientist at Microsoft, where she is part of the algorithms and data science team. Francesca is passionate about innovations in big data technologies and the applications of advanced analytics to real-world problems. Her work focuses on the deployment of machine learning algorithms and web service-based solutions to solve real business problems for customers in the energy, retail, and HR analytics sectors. Previously, she was a research fellow in business economics at Harvard Business School. She holds a PhD in innovation management.

Presentations

A Day in the Life of a Data Scientist: how do we train our teams to get started with AI? Session

What profession did Harvard Business Review call the Sexiest Job of the 21st Century? With the growing buzz of data science, several professionals have approached us at various events to learn more about how to become a data scientist. This session aims at raising awareness of what it takes to become a data-scientist and how artificial intelligence solutions have started to reinvent businesses.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Pig, Apache Arrow and a few others. Julien is a Principal Engineer at WeWork and was previously Architect at Dremio and tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

From flat files to deconstructed database: the evolution and future of the Big Data ecosystem. Session

Over the past 10 years the Big Data infrastructure has evolved from flat files in a distributed file system to an efficient ecosystem turning into a fully deconstructed and open source database with reusable components. We started from a system good at looking for a needle in a haystack using snowplows. A lot of horsepower and scalability but lacking the efficiency of relational databases.

Danielle helps clients approach, design, implement, and integrate new insights and advanced analytics data products that align with their business goals. She’s passionate about keeping data in context and applying research methods, best practices, and academic algorithms to industry business needs. With a strong background in machine learning, Danielle identifies the math, visualizations, and the business questions and processes necessary to create reliable predictive models and, ultimately, good, data driven business guidance. Danielle has worked in healthcare, academia, government, retail, gaming, and energy, and with quantified selfers, biohackers, hacklabs, and makerspaces. She is notoriously unreadable to GSR wearables. In her previous life, Danielle worked with the world’s most sophisticated wearable to date, the hearing aid. Currently, she focuses most of her time on data science in the energy sector.

Presentations

From Theory to Data Product - Applying Data Science Methods to Effect Business Change Tutorial

This tutorial is a primer on crafting well-conceived data science projects on course toward uncovering valuable business insights. Using case studies and hands-on skills development, we will teach techniques that are essential for a variety of audiences invested in effecting real business change.

Bob Levy is CEO of Virtual Cove, Inc. where his team leverages virtual & augmented reality for helping see datasets in a completely new light. A veteran tech executive, Mr. Levy brings over two decades’ product leadership experience including IBM, Harte Hanks & MathWorks. He also served as founding president of the BPMA in 2001, a 6,000+ person industry group.

Presentations

Augmented Reality: Going Beyond Plots in 3D Session

Augmented Reality opens a completely new lens on your data through which you see and accomplish amazing things. Learn how simple Python scripts can leverage completely new plot types. See use cases revealing new insight into financial markets data. Explore new ways of seeing & interacting with data to shed light on & build trust in otherwise “black box” machine learning solutions.

Jennifer Lim is the data and integration enterprise architect for Cerner Corporation, a company focused on creating intelligent solutions for the health care industry. Jennifer is responsible for the strategic planning, leadership, facilitation, analysis, and design tasks associated with the integration of internal Cerner applications. Her areas of focus include data management and governance, data architecture, API life cycle management, and services architecture.

Jennifer has over 18 years of experience in the telecommunications, banking and federal, and healthcare IT industries. She has filled roles in data analysis, data architecture, and application design having both built and used data warehouses, data marts, operational data stores, data lakes, API management platforms, and even the occasional application database. Jennifer holds a BS in management information systems and an MBA in management.

Presentations

Modernizing Operational Architecture with Big Data — Creating and Implementing a Modern Data Strategy Data Case Studies

Big Data expectations can no longer be technical requirements managed with bubble systems. It impacts our entire architecture including operational assets in areas like HR, Finance, Marketing, and Service Management. Share in approaches used to create our modern architecture strategy, realigning big data expectations with our business goals to increase our efficiency and innovation.

Chang Liu is an Applied Research Scientist and a member of the Georgian Impact team. She brings her in-depth knowledge of mathematical and combinatorial optimization to helping Georgian’s portfolio companies. Prior to joining Georgian Partners, Chang was a Risk Analyst at Manulife Bank, where she built models to assess the bank’s risk exposure based on extensive market research, including evaluating and predicting the impact of the oil price drop to the mortgage lending risks in Alberta in 2014.

Chang holds a Master’s of Applied Science in Operations Research from the University of Toronto, where she specialized in combinatorial optimization. She also holds a Bachelor’s Degree in mathematics from the University of Waterloo.

Presentations

Solving the Cold Start Problem: Data and Model Aggregation Using Differential Privacy Session

This talk outlines a common problem faced by many software companies, the cold-start problem, and how Georgian Partners has been successful at solving this problem by transferring knowledge from existing data through differentially private data aggregation.

Changshu Liu is a software engineer at Pinterest, where as one of the founding member of data engineering team, he built big data infrastructure like workflow engines and query processing engines. Before joining Pinterest, he worked for Facebook and Microsoft Research Asian on search infrastructure and big data processing system.

Presentations

Scaling Pinterest’s data lake in cloud using Hive, Presto and Spark SQL Session

At Pinterest we builds data lake on s3 where we read and write data directly. The footprint of the data lake exceeds 100 PB as the business expands rapidly which makes Pinterest as one of the biggest customers in AWS. This massive scale and the nature of S3 bring a lots of technical challenges on processing engines like Hive, Presto and Spark SQL.

Ingrid Liu is a senior software engineer (big data) at Novantas with an economics degree from Princeton University. Passionate about software and machine learning, she is a specialist in building Spark-based analytical platforms and enjoys innovating solutions in Fintech.

Presentations

Case Study : A Spark-based Distributed Simulation Optimization Architecture for Portfolio Optimization in Retail Banking Session

We discuss a large-scale optimization architecture in Spark for a consumer product portfolio optimization case study in retail banking—which combines a simulator that distributes computation of complex real-world scenarios given varying macro-economic factors, consumer behavior and competitor landscape, and a constraint optimizer that uses business rules as constraints to meet growth targets.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience and enjoys intelligent design and engaging storytelling. He is passionate about data, music, and nature.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

O'Reilly keynote Keynote

Details to come.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday Keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

I am a passionate advocate of Artificial Intelligence and how it will transform businesses. Over the past two and a half years I have led the creation of Baringa’s Data Science and Analytics team and supported our clients in their journeys to become leaders in Artificial Intelligence within their respective industries. Prior to this role, I worked as an independent data science consultant, in an investment bank and for the leading Formula 1 team. I have two first class Masters degrees in quantitative subjects and have published and patented a machine learning system.

Presentations

Predicting residential occupancy and hot water usage from high frequency, multi-vector utilities data Session

Future Home Energy Management Systems could improve their energy efficiency by predicting resident needs through utilities data. This session discusses the opportunity with a particular focus on the key data features, the need for data compression and the data quality challenges.

Joseph Lubin is a co-founder of blockchain computing platform Ethereum and the founder of Consensus Systems (ConsenSys), a blockchain venture studio. ConsenSys is one of the largest and fastest-growing companies in the blockchain technology space, building developer tools, decentralized applications, and solutions for enterprises and governments that harness the power of Ethereum. Headquartered in New York, ConsenSys also has a global presence, employing top entrepreneurs, computer scientists, software developers, and experts in enterprise delivery worldwide.

Lubin graduated from Princeton University with a degree in Electrical Engineering and Computer Science. He worked in the Princeton Robotics Lab, at tomandandy music developing an autonomous music composition tool, and at private research firm Vision Applications Inc. building autonomous mobile robots.

As a software engineer and consultant, Lubin worked with eMagine on the Identrus project and was involved in the founding and operation of a hedge fund with a partner. He held positions as Director of the New York office of Blacksmith Software Consulting, and VP of Technology in Private Wealth Management at Goldman Sachs. Through these posts, Lubin focused on the intersection of cryptography, engineering, and finance.

Switching gears, Lubin moved to Kingston, Jamaica to work on projects in the music industry. Two years into his musical endeavors, Lubin co-founded the Ethereum Project and has been working on Ethereum and ConsenSys since January 2014.

Presentations

Keynote with Joseph Lubin Keynote

Joseph Lubin, co-founder of blockchain computing platform Ethereum, founder of Consensus Systems

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Over his career, he has been responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams Tutorial

This hands-on tutorial builds streaming apps as microservices using Kafka with Akka Streams and Kafka Streams. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll be better informed when choosing tools for your needs. We'll contrast them with Spark Streaming and Flink, including when to chose them instead. The sample apps demonstrate ML model serving ideas.

Gerard Maas is a senior software engineer at Lightbend, where he contributes to the Fast Data Platform and focuses on the integration of stream processing technologies. Previously, he held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms, and scalable APIs. He is the coauthor of Stream Processing with Apache Spark from O’Reilly. Gerard is a frequent speaker and contributes to small and large open source projects. In his free time, he tinkers with drones and builds personal IoT projects.

Presentations

Processing Fast Data with Apache Spark: The Tale of Two APIs Session

Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. We will provide a critical view of their differences in key aspects of a streaming application: API usability, dealing with time, dealing with state and machine learning capabilities. We will round up with practical guidance on picking one or combining both to implement resilient streaming pipelines.

Swetha Machanavajhala works as a software engineer in Azure Networking, building tools to help engineers detect and diagnose network issues within seconds. She is very passionate in building products and awareness for people with disabilities. This passion made her lead several projects at hackathons, winning multiple awards. Swetha has driven such projects from idea to reality to launching as a beta product. She is also a co-lead of disability Employee Resource Group where she represents the community of people who are deaf or hard of hearing and is a part of the ERG chair committee. Adding to the list, Swetha is a public speaker and has given several talks internally at Microsoft as well as at external events.

Presentations

Deep Learning on audio in Azure to detect sounds in real-time Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds (dog bark, alarms, people calling from behind etc.,). We all take this for granted, there are over 360 million in this world who are deaf or hard of hearing. How can we make the auditory world inclusive as well as meet the great demand in other sectors by applying deep learning on audio in Azure?

Mark Madsen is the global head of architecture at Think Big Analytics, where he is responsible for understanding, forecasting, and defining the analytics landscape and architecture. Prior to this, he was CEO of Third Nature, where he advised companies on data strategy and technology planning. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. We will explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the Internet of Things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Executive Briefing: Managing successful data projects - technology selection and team building Session

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. In this session we'll provide guidance and best practices to help technical leaders deliver successful projects from planning to implementation.

Shankar has 17+ years of experience building distributed systems and productivity tools. He started out building a highly successful distributed test automation for windows and bing in microsoft. Then he spent the 8 years help build a middle tier platform that powered most of the online services that formed the backbone of bing and microsoft ads. He is currently leading the grid productivity team in bangalore. Empowering hadoop developers at linkedin to be more productive with their time and cluster resources.

Presentations

TuneIn: How to get your jobs tuned while you are sleeping Session

Have you ever tuned a Spark or MR job? If the answer is yes, then you already know how difficult it is to tune more than hundred parameters to optimize the resources used. With Dr. Elephant we introduced heuristic based tuning recommendations. Now, we introduce TuneIn, an auto tuning tool developed to minimize the resource usage of jobs. Experiments have shown upto 50% reduction in resource usage.

Baolong Mao is a big data platform development engineer at JD.com, where he works on the company’s big data platform and focuses on big data ecosphere. He is an open source developer, Alluxio PMC and contributor, and Hadoop contributor. He’s a fan of technology sharing and open source.

Presentations

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks Session

Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.

Jaya Mathew is a senior data scientist on the artificial intelligence and research team at Microsoft, where she focuses on the deployment of AI and ML solutions to solve real business problems for customers in multiple domains. Previously, she worked on analytics and machine learning at Nokia and Hewlett-Packard. Jaya holds an undergraduate degree in mathematics and a graduate degree in statistics from the University of Texas at Austin.

Presentations

A Day in the Life of a Data Scientist: how do we train our teams to get started with AI? Session

What profession did Harvard Business Review call the Sexiest Job of the 21st Century? With the growing buzz of data science, several professionals have approached us at various events to learn more about how to become a data scientist. This session aims at raising awareness of what it takes to become a data-scientist and how artificial intelligence solutions have started to reinvent businesses.

Les McMonagle (CISSP, CISA, ITIL) – VP of Security Strategy, BlueTalon Inc.

Les has over twenty years’ experience in information security. He has held the position of Chief Information Security Officer (CISO) for a credit card company and ILC bank, founded a computer training and IT outsourcing company in Europe, directed the security and network technology practice for Cambridge Technology Partners across Europe and helped several security technology firms develop their initial product strategy. Les founded and managed Teradata’s Information Security, Data Privacy and Regulatory Compliance Center of Excellence, was Chief Security Strategist at Protegrity and is currently Vice President of Security Strategy at BlueTalon.

Les holds a BS in MIS, CISSP, CISA, ITIL and other relevant industry certifications.

Presentations

Privacy by Design – Building data privacy and protection in, versus bolting it on later Session

"Privacy by Design" is a fundamentally important approach to achieving compliance with GDPR and other data privacy or data protection regulations. This session will outline how organizations can save time and money while improving data security and regulatory compliance and dramatically reduce the risk of a data breach or expensive penalties for non-compliance.

Matteo Merli is a software engineer at Streamlio working on messaging and storage technologies. Previously, he spent several years at Yahoo building database replication systems and multitenant messaging platforms. Matteo was the architect and lead developer for Yahoo Pulsar and a member of the PMC of Apache BookKeeper.

Presentations

High Performance Messaging with Apache Pulsar Session

Apache Pulsar, a messaging system is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it is very important to ensure that the system can make use of all the available resources. This talk will provide insight on on the design decisions and the implementation techniques that allow Pulsar high achieve performance with strong durability guarantees.

Cory Minton is a Staff Technologist for Dell EMC in their Ready Solutions team, where he works hand in hand with clients across the globe to assess and develop big data strategies, architect technology solutions, and insure successful deployments of these transformational initiatives. A geek, technology evangelist and business strategist, Cory is focused on finding creative ways for organizations to drive the utmost value from their data while transforming IT’s relevance to the organizations and customers they serve. With a diverse background in IT applications, consulting, data center infrastructure, and the expanding Big Data ecosystem, Cory brings an interesting perspective to the clients he serves while consistently challenging them to think bigger. Cory holds an undergraduate degree in engineering from Texas A&M University and an MBA from Tennessee Tech University. Cory resides in Birmingham, Alabama, with his beautiful wife and two awesome children.

Presentations

DIY vs. designer approaches to deploying data center infrastructure for machine learning and analytics Session

How to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble

Mridul Mishra is responsible for emerging technology in Asset Management group at Fidelity Investments.He is responsible for machine learning/AI projects and has been involved with an approach to put these projects in production. Mridul has around 21 years of experience building enterprise software ranging from core trading software to smart applications using AI/ML capabilities.

Presentations

Explainable Artificial Intelligence (XAI): Why, When, and How? Findata

Currently, most of the ML (specifically deep learning) models work like a black box and a key challenge in their adoption is the need for explainability. In this talk, we will explore the reason for the need of explainability, current state and provide a framework to think for these needs and the potential solution options.

Nina Mishra is principal scientist at Amazon Web Services, where she focuses on data science, data mining, web search, machine learning, and privacy. Nina has many years of experience leading projects at Amazon, Microsoft Research, and HP Labs. She was also an associate professor at the University of Virginia and an acting faculty member at Stanford University. Nina’s research encompasses the design and evaluation of new data mining algorithms on real, colossal-sized datasets. She has authored almost 50 publications in top venues, including WWW, WSDM, SIGIR, ICML, NIPS, AAAI, COLT, VLDB, PODS, CRYPTO, EUROCRYPT, FOCS, and SODA, which have been recognized with best paper award nominations. Nina’s research was central to the Bing search engine and has been widely featured in external press coverage. Nina holds 14 patents with a dozen more still in the application stage. She has had the distinct privilege of helping others advance in their careers, including 15 summer interns and many full-time researchers. Nina’s service to the community includes serving on journal editorial boards Machine Learning, the Journal of Privacy and Confidentiality, IEEE Transactions on Knowledge and Data Engineering, and IEEE Intelligent Systems and chairing the premier machine learning conference ICML in 2003, as well as serving on numerous program committees for web search, data mining, and machine learning conferences. She was awarded an NSF grant as a principal investigator and has served on eight PhD dissertation committees.

Presentations

Continuous machine learning over streaming data, the story continues. Session

Understand how unsupervised learning can provide insights into streaming data, with new applications to impute missing values, to forecast future values, detect hotspots and perform classification tasks, and how to efficiently implement to operate in real-time over massive data streams.

Sanjeev Mohan leads Big Data research at Gartner for Technical Professionals. His areas of expertise span the end-to-end data pipelines including ingestion, persistence, integration, transformation and advanced analytics. He researches trends and technologies for relational and NoSQL databases as well as object stores and cloud databases. He is a well-respected speaker on Big Data and Data Governance. His research includes Machine Learning and IoT. He is also on panel of judges for many Hadoop distribution organizations such as Cloudera and Hortonworks.

Presentations

Executive Briefing: Enhance your Data Lake with comprehensive Data Governance to improve adoption and meet compliance needs Session

If the last few years were spent proving the value of data lakes, the emphasis now is to monetize the big data architecture investments. The rallying cry is to onboard new workloads efficiently. But, how does one do so if they don’t know what data is in the lake, the level of its quality and the trustworthiness of models? This is why data governance becomes the linchpin to the success of lakes.

Mostafa Mokhtar is a performance engineer at Cloudera. Previously, he held similar roles at Hortonworks and on the SQL Server team at Microsoft.

Presentations

Optimizing Apache Impala for a cloud-based data warehouse Session

Cloud object stores are becoming the bedrock of a cloud data warehouse for modern data-driven enterprises. Given today's data sizes, it's become a necessity for data teams to have the ability to directly query data stored in S3 or ADLS. In this talk, we'll discuss the optimal end-to-end workflows and technical considerations of using Apache Impala over object stores for your cloud data warehouse.

Andrew Montalenti is the co-founder and CTO of Parse.ly, a widely-used real-time web content analytics platform. The product is trusted daily by editors at HuffPost, TIME, TechCrunch, Slate, Quartz, The Wall Street Journal, and over 350 other leading digital companies. Andrew is a dedicated Pythonista who has presented his team’s work at the PyCon and PyData conferences. He is also the co-host of the web data/analytics podcast, The Center of Attention. Follow him on Twitter via @amontalenti and check out Parse.ly’s research on Internet attention via @parsely.

Presentations

Applying petabyte-scale analytics and machine learning to billions of news reading sessions Session

What can we learn from a one-billion-person live-poll of the Internet? Parse.ly has gathered a unique data set of news reading sessions of billions of devices, peaking at over 2 million sessions per minute on thousands of high-traffic news and information websites. Our team of data scientists and machine learning engineers have used this data to unearth the secrets behind online content.

Software Engineer /Lead Architect

Presentations

Self-Service Modern Analytics on the GovCloud Session

Lockheed Martin is a data driven company that has a massive variety and volume of data. To extract the most value from our information assets, we’re constantly exploring ways to enable effective self-service scenarios. To address these challenges we are building an analytics platform with a focus on leveraging AWS GovCloud. Join us to understand our journey into modern analytics.

The first time I met the word data it was just the plural of datum.

I am a BI Architect at Zalando, where I am redesigning the current Data Infrastructure. I like to solve problems and to learn new things.

I like to draw data models and optimize queries.

In my free time I have a daughter, for some reasons she speaks four languages (but not my dialect).

Presentations

Scaling Data Infrastructure in the fashion world or “What is this? Business Intelligence for ants?” Session

The story of how Zalando went from old school BI to an AI driven company built on a solid data platform, what we learned in the process and what are the challenges we still see in front of us.

Ash Munshi is CEO of Pepperdata. Previously, Ash was executive chairman for deep learning startup Marianas Labs (acquired by Askin in 2015); CEO of big data storage startup Graphite Systems (acquired by EMC DSSD in 2015); CTO of Yahoo; and CEO of a number of other public and private companies. He serves on the board of several technology startups.

Presentations

Classifying Job Execution Using Deep Learning Session

In this talk we will describe a technique for labeling applications using runtime measurements of CPU, memory, i/o and network and a deep neural network. This labeling groups the applications into buckets that have understandable characteristics and which can then be used to reason about the cluster and its performance.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).

Presentations

Setting Up a Lightweight Distributed Caching Layer using Apache Arrow Session

This talk will deep-dive on a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. We'll start with an overview of the system design and deployment architecture. This includes coverage of cache lifecycle, update patterns, cache cohesion and appropriate use cases.

Niraj Nagrani joined Ancestry, the global leader in family history and consumer genomics, in September 2017 as Senior Vice President of Data and Cloud Platform. He is responsible Prior to joining Ancestry, Niraj was Global Head of Platforms for Cloud, Data, Analytics, Frameworks, Products and Engineering at American Express. There, he was responsible for consumer, merchants, corporate Application, Cloud and Data Science Platform. Earlier, Niraj was VP of Engineering and Product at A16Z funded SnapLogic and General Manager of Microsoft Azure Cloud and O365 Engineering at Microsoft in addition to prior senior executive product and engineering leadership roles at Oracle, Interwoven, and Cap Gemini.

Presentations

Using Data Science & Technology to Deliver Personalized Insights from Genome at Scale: An Ancestry.com Case Study Session

Ancestry has more than 10 petabytes of structured and unstructured data. Ancestry’s SVP of platform, Niraj Nagrani, will discuss how companies can build a data platform that uses cloud computing, Data Science, Artificial Intelligence and Machine Learning to analyze complex data sets at scale to provide personalized insights and relationship graph to consumers.

Paco Nathan leads the Learning Group at O’Reilly Media, and is the program co-chair for JupyterCon. Known as a “player/coach”, Paco led innovative data teams building ML apps at scale for several years and more recently was the developer evangelist for Apache Spark. His expertise is in machine learning, NLP, distributed systems, and cloud computing with 35+ years of tech industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

Executive Briefing: Best practices for Human-in-the-Loop - the business case for Active Learning Session

Deep learning works well when you have large labeled datasets, but not every team has those assets. *Active learning* is a ML variant which incorporates *human-in-the-loop*. It focuses input from human experts, leveraging intelligence already in the system, and provides systematic ways to explore/exploit "uncertainty" in your data. Strategy emerges for managing teams of people + automation.

As the Director of Business Strategies for SAS Best Practices Kimberly balances forward-thinking with real-world perspectives on business analytics, data governance, analytic cultures and change management.

Kimberly’s current focus is helping customers understand both the business potential and practical implications of artificial intelligence (AI) and machine learning (ML).

Presentations

Rationalizing Risk in AI/ML Session

Too often, the discussion of AI and ML includes an expectation - if not a requirement - for infallibility. But as we know, this expectation is not realistic. So what’s a company to do? While risk can’t be eliminated, it can be rationalized. This session will demonstrate how a unflinching risk assessment enables AI/ML adoption and deployment.

Ann Nguyen evangelizes design for impact at Whole Whale, where she leads the tech and design team in building meaningful digital products for nonprofits. She has designed and managed the execution of multiple websites for organizations including the LAMP, Opportunities for a Better Tomorrow, and Breakthrough. Ann is always challenging designs with A/B testing. She bets $1 on every experiment that she runs and to date has accumulated a decent sum. Previously, Ann worked with a wide range of organizations from the Ford Foundation to Bitly. She is Google Analytics and Optimizely Platform certified. Ann is a regular speaker on nonprofit design and strategy and recently presented at the DMA Nonprofit Conference. She has also taught at Sarah Lawrence College. Outside of work, Ann enjoys multisensory art, comedy shows, fitness, and making cocktails, ideally all together.

Presentations

How to Be Aggressively Tone-Deaf Using Data (or, We Should All Be For-Benefits) Data Case Studies

Google returns 97,900,000 results for “data-driven business.” Innovation is the key to survival and data, combined with design thinking and iteration is a proven path. The problem is that this system lacks a conscious, it lacks empathy thinking.

Minh Chau Nguyen is a researcher in the Big Data Software Platform Research department at the Electronic and Telecommunications Research Institute (ETRI), one of the largest government-funded research institutes in Korea. His research interests include big data management, software architecture, and distributed systems.

Presentations

A Data Marketplace Case Study with Blockchain and Advanced Multitentant Hadoop in Smart Open Data Platform Session

This session will address how analytics services in data marketplace systems can be performed on one single Hadoop cluster across distributed data centers. We extend the overall architecture of Hadoop ecosystem with blockchain so that multiple tenants and authorized third parties can securely access data to perform various analytics while still maintaining the privacy, scalability and reliability.

Anna Nicanorova is Director of Annalect Labs – space for experimentation and rapid prototyping within Annalect. During her time at Annalect she has worked on numerous data-marketing solutions: attribution, optimizers, quantification of content and image recognition technology. In 2015 Anna was part of Annalect team, that won I-Com Data Science Hackathon 2015.

Anna is Co–Founder of Books+Whiskey meetup and coding volunteer teacher with ScriptEd. She holds an MBA from University of Pennsylvania – The Wharton School and BA from Hogeschool van Utrecht.

Presentations

Data Visualization in Mixed Reality with Python Session

Data Visualization is supposed to be our map to information. However, contemporary charting techniques have a few shortcomings: Context Reduction, Hard numeric grasp, Perceptual de-humanization. Augmented Reality can potentially solve all of the issues listed above by presenting an intuitive and interactive environment for data exploration.

Tawny has over 15 years’ experience supporting the homecare industry. As Chief Information Officer at SelectData, she is responsible for new product development, clinical tools, and all technology related needs. She also leads SelectData’s innovation of data-driven business models. Tawny is currently pursuing a Master of Science in Health Care Informatics degree at the University of San Diego.

Presentations

Spark NLP in Action: How SelectData Uses AI to Better Understand Home Health Patients Session

This case study describes a question answering system, for accurately extracting facts from free-text patient records. The solution is based on Spark NLP - an open source extension of Spark ML, providing state of the art performance and accuracy for natural language understanding. We'll share best practices for training domain specific deep learning NLP models as such problems usually require.

Aileen Nielsen is a software engineer at One Drop, a company working on diabetes-management products. Aileen has worked in corporate law, physics research laboratories, and, most recently, NYC startups oriented toward improving daily life for underserved populations—particularly groups who have yet to fully enjoy the benefits of mobile technology. Her interests range from defensive software engineering to UX designs for reducing cognitive load to the interplay between law and technology. She currently serves as a member of the New York City Bar Association’s Science and Law committee, where she chairs a subcommittee devoted to exploring and advocating for scientifically driven regulation (and deregulation) of new and existing technologies. Aileen holds degrees in anthropology, law, and physics from Princeton, Yale, and Columbia respectively.

Presentations

How to be fair: a tutorial for beginners Tutorial

There is mounting evidence that the widespread deployment of machine learning and artificial intelligence in business and government applications is likely reproducing or even amplifying existing prejudices and social inequalities. This tutorial is designed to give knowledge and tools to data scientists so they can identify and avoid bias and other unfairness in their analyses.

Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Owen O’Malley is a software architect on Hadoop working for HortonWorks, a startup focusing on Hadoop development. Prior to cofounding HortonWorks, Owen and the rest of the HortonWorks team worked at Yahoo developing Hadoop. He has been contributing patches to Hadoop since before it was separated from Nutch and was the original chair of the Hadoop PMC. Before working on Hadoop, he worked on Yahoo Search’s WebMap project, which builds a graph of the known Web and applies many heuristics to the entire graph that control search. Prior to Yahoo, Owen wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He holds a PhD in software engineering from the University of California, Irvine.

Presentations

Introducing Iceberg: Tables Designed for Object Stores Session

Iceberg is a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.

Brian O’Neill is a product designer and founder of the consulting firm, Designing for Analytics, which helps companies design indispensable data products and analytics solutions that customers love. His clients and past employers include DELL/EMC, NetApp, Tripadvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JP Morgan Chase, the Future of Music Coalition, and ETrade among others, and he has worked on award-winning IT/storage industry software for Akorri and Infinio. Brian has also brought over 20 years of design experience to various podcasts, meetups, and conferences such as the O’Reilly Strata Conference in New York City and London, England. Has has also authored the Designing for Analytics Self-Assessment Guide for Non-Designers as well numerous articles on design strategy, user experience, and business related to analytics. Brian is also an Expert Advisor on the topics of design and user experience for the International Institute for Analytics.

When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage performing as a professional percussionist and drummer. He leads the acclaimed dual-ensemble, Mr. Ho’s Orchestrotica that is “anything but straightforward” (Washington Post) and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival.

If you’re at a conference, just look for only guy with a stylish orange leather messenger bag.

Presentations

UX Strategies for Underperforming Analytics Services and Data Products Session

Gartner says 85%+ of big data projects will fail, despite the fact your company may have invested millions on engineering implementation. Why are customers and employees not engaging with these products and services? CDOs, CIOs, product managers, and analytics leaders with a "people first, technology second" mission–a design strategy–will realize the best UX and business outcomes possible. #design

Troels Oerting is a globally recognized Cyber Security Expert who now holds a number of Board post as Non Executive Director in key companies, as well as high profile advisory roles. He has been working in cyber/security ‘first line’ for the last 38 years and has held a number of significant posts both nationally and international.
Troels joined Barclays as Group Chief Information Security Officer (CISO) early 2015. Reporting to Group Chief Operations and Technology Officer. He was appointed Group Chief Security Officer in 2016 with end to end responsibility of all security in Barclays Group, responsible for more than 3000 security experts World Wide protecting the banks 50 million customers and 140.000 employees.
Before joining Barclays Troels held the position as Director of the European Cybercrime Centre (EC3), an EU wide centre located in Europol’s HQ with the task to assist Law Enforcement Agencies in protecting 500 million citizens in the 28 EU Member States from cybercrime or loss of privacy.
As an expert in cyber security Troels has constantly been looking for new legislative, technical or cooperation opportunities to efficiently protect privacy and security for users of the Internet. He has been pioneering new methodologies to prevent crime in Cyberspace and protect innocent users from losing their digital identity, assets or privacy online. As Director of EC3 he also initiated the establishment of the International ‘Joint Cybercrime Action Task Force’ (J-CAT) including global leading law enforcement agencies, prosecutors and Interpol’s Global Centre of Innovation and the JCAT has since been recognised as the leading international response to the increasing threat from Organized Cyber Criminal networks.
He has been Cyber advisor for the EU Commission and Parliament and been a permanent delegate in many governance organisations i.e. ICANN, ITU and The Council of Europe and used by several governments and organisations as advisor in cyber related questions. He also established a vast global Outreach programme including law enforcement, NGO’s, key tech companies and industry who together with Academic Research Institutes established a multifaceted global coalition against cybercriminal syndicates and networks, with the aim to enhance online security without harming privacy and to invent new ways of protecting users of the Internet. Before joining Europol as Director for EC3 Troels held the position as Assistant Director for Europol Organised Crime department as well as the Counter Terrorist Department and he also held positions as Director of Operation in the Danish Security Intelligence Service and Director for the Danish Serious Organised Crime Agency (SOCA).
Troels is also an extern lecturer in cybercrime at a number of Universities and Business Schools and has been Internationally awarded several times by global law enforcement agencies for his international leadership in fighting cyber- and organised crime. He is author of a political thriller published in Danish: Operation Gamma.

Presentations

Next Generation Cybersecurity via Data Fusion, AI and BigData: Pragmatic Lessons from the Font Lines in Financial Services Session

This presentation will share the main outcomes and learnings from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on BigData and AI at a major EU bank and in collaboration with several financial services institutions. The focus is on learnings and breakthroughs gleaned from making the systems work

Diego Oppenheimer, founder and CEO of Algorithmia, is an entrepreneur and product developer with extensive background in all things data. Prior to founding Algorithmia he designed , managed and shipped some of Microsoft’s most used data analysis products including Excel, Power Pivot, SQL Server and Power BI.
Diego holds a Bachelors degree in Information Systems and a Masters degree in Business Intelligence and Data Analytics from Carnegie Mellon University.

Presentations

Deploying ML Models in the Enterprise Session

After big investments in collecting & cleaning data, and building Machine Learning models, enterprises discover the big challenges in deploying models to production and managing a growing portfolio of ML models. This talk covers the strategic and technical hurdles each company must overcome and the best practices we've developed while deploying over 4,000 ML models for 70,000 engineers.

Francois Orsini is the chief technology officer for MZ’s Satori business unit. Previously, he served as vice president of platform engineering and chief architect, bringing his expertise in building server-side architecture and implementation for a next-gen social and server platform; was a database architect and evangelist at Sun Microsystems; and worked in OLTP database systems, middleware, and real-time infrastructure development at companies like Oracle, Sybase, and Cloudscape. Francois has extensive experience working with database and infrastructure development, honing his expertise in distributed data management systems, scalability, security, resource management, HA cluster solutions, soft real-time and connectivity services. He also collaborated with Visa International and Visa USA to implement the first Visa Cash Virtual ATM for the internet and founded a VC-backed startup called Unikala in 1999. Francois holds a bachelor’s degree in civil engineering and computer sciences from the Paris Institute of Technology.

Presentations

Correlation analysis on live data streams Session

One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. We shall walk the audience through how to marry correlation analysis with anomaly detection can help and share techniques to guide effective decision making.

Occhio Orsini has over 25 years’ experience in the data and analytics technology platforms. He started his career in application development, and then spent time developing database engine technology, and internet search technology for heritage Ascential software. Then IBM acquired Ascential and Occhio played a central role in creation of the IBM Information Server Suite. At this time he wanted to really improve the adoption of these technologies and took a position in Aetna’s Enterprise Architecture group and worked on the strategy and adoption of data analytics and data governance platforms. Then as Big Data and data science became the new direction for analytics, Occhio lead the Solution engineering and architecture efforts to build Aetna’s Data Fabric which supports the companies advanced analytics initiatives across the organization.

Presentations

Aetna's Advanced Analytics Platform (Data Fabric) Session

Aetna's Data Fabric platform is based on the Hadoop technology stack but has integrated many different technologies to create a robust Data Lake and Advanced Analytics platform to meet the needs to Aetna's Data Scientists and analytics practitioners.

Steve Otto is the Associate Director of the Enterprise Architecture team at Navistar and helps shape the technology strategy and architecture to drive business goals. He was formerly the manager of the Information Management team at Navistar.

Mr. Otto started his career as developer in the management consulting practice at Ernst & Young and has had a variety of roles in his IT career. Mr Otto had worked in a number of different capacities and had direct responsibility for a wide range of activities, including the planning, design, build, operation, and support functions for IT projects in consumer products, retail, aerospace and defense, healthcare, manufacturing, and higher education markets.

Presentations

Driving Predictive Analytics for IoT & Connected Vehicles Data Case Studies

Navistar built an IoT-enabled remote diagnostics platform, called OnCommand®™ Connection, to bring together data from 375,000+ vehicles in real-time, to drive predictive analytics. This service is being offered to fleet owners who can now monitor the health and performance of their trucks from smartphones or tablets. Join Steven Otto, from Navistar to learn more about their IoT & data journey.

Jerry Overton is a data scientist and distinguished technologist in DXC’s Analytics group, where he is the principal data scientist for industrial machine learning, a strategic alliance between DXC and Microsoft comprising enterprise-scale applications across six different industries: banking and capital markets, energy and technology, insurance, manufacturing, healthcare, and retail. Jerry is the author of Going Pro in Data Science: What It Takes to Succeed as a Professional Data Scientist (O’Reilly) and teaches the Safari training course Mastering Data Science at Enterprise Scale. In his blog, Doing Data Science, Jerry shares his experiences leading open research and transforming organizations using data science.

Presentations

Minimum-Viable Machine Learning: The Applied Data Science Bootcamp (Sponsored by DXC Technology) 1-Day Training

Acquiring machine-learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp, we teach students how to apply advanced analytics in ways that reshape the enterprise and improve outcomes. This training is equal parts hackathon, presentation, and group participation.

Mani Parkhe is an ML/AI Platform Engineer at Databricks, working on various customer facing and open source platform initiatives to enable data discovery, training, experimentation, and deployment of ML models on the cloud. He is a life long student, coding geek with a passion for elegance in design. After spending 15 years building software for semiconductor chip CAD, Mani transitioned to building big data infrastructure, distributed systems and web services, and machine learning. Prior to Databricks, he has worked on various data intensive batch and stream processing problems at LinkedIn and Uber. Mani has a Masters degree in CS from University of Florida. He lives in Almaden Valley with his wife and three amazing kids.

Presentations

MLflow: An open platform to simplify the machine learning lifecycle Session

Successfully building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what's running where, and to redeploy and rollback updated models is much harder.

Joshua Patterson is the director of applied solutions engineering at NVIDIA. Previously, Josh worked with leading experts across the public and private sectors and academia to build a next-generation cyberdefense platform. He was also a White House Presidential Innovation Fellow. His current passions are graph analytics, machine learning, and GPU data acceleration. Josh also loves storytelling with data and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina’s Moore School of Business.

Presentations

Accelerating Financial Data Science Workflows With GPU Session

GPUs have allowed financial firms to run complex simulations, train myriads of models, and data mine at unparalleled speeds. Today, the bottleneck has moved completely to ETL. With the GPU Open Analytics Initiative (GoAi), we’re accelerating ETL and keeping the entire workflow on GPUs. We’ll discuss real-world examples, benchmarks, and how we’re accelerating our largest FS customers.

Jennifer Prendki is currently the VP of Machine Learning at Figure Eight, the essential human-in-the-loop AI platform for data science and machine learning teams. She has spent most of her career creating a data-driven culture wherever she went, succeeding in sometimes highly skeptical environments. She is particularly skilled at building and scaling high-performance Machine Learning teams, and is known for enjoying a good challenge. Trained as a particle physicist (she holds a PhD in Particle Physics from Sorbonne University), she likes to use her analytical mind not only when building complex models, but also as part of her leadership philosophy. She is pragmatic yet detail-oriented. Jennifer also takes great pleasure in addressing both technical and non-technical audiences at conferences and seminars, and is passionate about attracting more women to careers in STEM.

Presentations

Executive Briefing: Agile for Data Science teams Session

The Agile Methodology has been widely successful for Software Engineering teams, but seems inappropriate for Data Science teams. This is because Data Science is part-engineering, part-research. In this talk, I will show how, with a minimum amount of tweaking, Data Science managers can adapt the techniques used in Agile and establish best practices to make their teams more efficient.

Amanda C. Pustilnik is a professor of law at the University of Maryland School of Law and permanent faculty at the Center for Law, Brain & Behavior at Massachusetts General Hospital. Her work focuses on the intersections of law, science, and culture, with a particular emphasis on neuroscience and neurotechnologies. In 2015, she served as Harvard Law School’s first senior fellow on law and applied neuroscience, where she focused on the neuroimaging of pain in itself and as a model for imaging subjective states relevant to law. Her collaborations with scientists on pain-related brain imaging, and her expertise in criminal law, led to her recent work on the opioids crisis on behalf of the Aspen Institute. She also writes and teaches in the areas of scientific and forensic evidence, on which she helps train federal and state judges.

Prior to entering the academy, Prof. Pustilnik practiced litigation at Covington & Burling and at Sullivan & Cromwell, clerked on the Second Circuit Court of Appeals, and worked as a management consultant at McKinsey & Co. She is a graduate of Harvard College and Yale Law School, and completed a fellowship at the University of Cambridge, where she studied history and philosophy of science. Her work has been published in numerous law reviews and peer-reviewed scientific journals, including Nature.

Presentations

Keynote with Amanda Pustilnik Keynote

Amanda Pustilnik, professor of law at the University of Maryland School of Law, and permanent faculty at the Center for Law, Brain & Behavior at Massachusetts General Hospital.

Gregory M. Quist, Ph.D.
President & CEO
Greg is the co-founder, President and CEO of SmartCover Systems, leading the strategic direction and operations of the Company. Greg is a long-time member of the water community, elected to the Rincon del Diablo MWD Board of Directors in 1990, where he has served for the past 27 years in various roles including President and Treasurer. Rincon’s Board appointed Greg to the San Diego County Water Authority Board in 1996 for 12 years where he led a coalition of seven agencies to achieve more than $1M/year in water delivery savings. He is currently the Chairman of the Urban Water Institute. With a background in the areas of metamaterials, numerical analysis, signal processing, pattern recognition, wireless communications, and system integration, Greg has worked as a technologist, manager and executive at Alcoa, McDonnell-Douglas, and SAIC and has founded and successfully spun off several high technology start-up companies, primarily in real-time detection and water technology. He holds 14 patents and has several pending. Greg received his undergraduate degree in astrophysics with a minor concentration in economics from Yale College where he played football and baseball and his Ph.D. in physics from the University of California, Santa Barbara. He has held top-level government clearances and currently resides in Escondido, CA. In his rare free time he enjoys fly fishing, hiking, golf, basketball, and tennis.

Presentations

Sewers can Talk – Understanding the Language of Sewers Data Case Studies

The first step in solving this crisis is knowing the extent and severity of the problem. Water levels in sewers have a signature, analogous to a human EKG. This signature can be analyzed in real-time, using pattern recognition techniques, revealing distressed pipeline and allowing users of this technology to take appropriate steps for maintenance and repair. Sewers can talk!

Syed is a Principal System Engineer at Cloudera specializing in Big Data on Hadoop technologies since 2009. He is responsible for designing, building, developing and assuring a number of enterprise level Big Data platforms using the Cloudera distribution. He has a focus on security specializing in both Platform and Cyber Security. He has worked across multiple sectors including but not limited to government, telecoms, media, utilities, financial services and transport.

Presentations

Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.

Greg Rahn is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over his 20-year career, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently, product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Optimizing Apache Impala for a cloud-based data warehouse Session

Cloud object stores are becoming the bedrock of a cloud data warehouse for modern data-driven enterprises. Given today's data sizes, it's become a necessity for data teams to have the ability to directly query data stored in S3 or ADLS. In this talk, we'll discuss the optimal end-to-end workflows and technical considerations of using Apache Impala over object stores for your cloud data warehouse.

Mala Ramakrishnan heads product initiatives for Cloudera Altus – big data platform-as-a-service. She has 17+ years experience in product management, marketing, and software development in organizations of varied sizes that deliver middleware, software security, network optimization, and mobile computing. She holds a master’s degree in computer science from Stanford University.

Presentations

Comparative Analysis of the Fundamentals of AWS and Azure Session

The largest infrastructure paradigm change of the 21st Century is the shift to the cloud. Companies are faced with the difficult and daunting decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. In this talk we use our experience from building production services on AWS and Azure to compare their strengths and weaknesses.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Designing Modern Streaming Data Applications Tutorial

In this tutorial, we will walk the audience through the landscape of state-of-the-art systems for each stage of a end-to-end data processing pipeline, viz., messaging frameworks, streaming computing frameworks, storage frameworks for real-time data. We will also walk through case studies from IoT, Gaming and Healthcare, and share our experiences operating these systems at Internet scale.

High Performance Messaging with Apache Pulsar Session

Apache Pulsar, a messaging system is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it is very important to ensure that the system can make use of all the available resources. This talk will provide insight on on the design decisions and the implementation techniques that allow Pulsar high achieve performance with strong durability guarantees.

Suyash Ramineni is a Pre-Sales Engineer at Cloudera. At Cloudera, he’s focused on helping customers solve business problems at scale. He’s part of the Cloudera Field IoT and Manufacturing Specialization team and is usually occupied guiding Platform teams to manage and deploy Data Analytics and Data Science workloads on large clusters. Previously, Suyash worked as a Software Engineer at Intel and a few startups, focused on a data-driven approach to solve problems. He has 8 years of experience working with customer and partners to solve business problems.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Radhika Rangarajan is an engineering director for big data technologies within Intel’s Software and Services Group, where she manages several open source projects and partner engagements, specifically on Apache Spark and machine learning. Radhika is one of the cofounders and the director of the West Coast chapter of Women in Big Data, a grassroots community focused on strengthening the diversity in big data and analytics.

Presentations

Job recommendations leveraging Deep Learning on Apache Spark with BigDL Session

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? In this session, we will demonstrate how to leverage BigDL on Apache Spark (A Distributed Deep learning framework for Apache Spark*) to predict a candidate’s probability of applying to specific jobs based on their resume.

Delip Rao is the founder of Joostware AI Research Corp., which specializes in consulting and building IP in natural language processing and deep learning. Delip is a well-cited researcher in natural language processing and machine learning and has worked at Google Research, Twitter, and Amazon (Echo) on various NLP problems. He is interested in building cost-effective, state-of-the-art AI solutions that scale well. Delip has an upcoming book on NLP and deep learning from O’Reilly.

Presentations

Machine Learning with PyTorch 1-Day Training

Explore machine learning and deep learning with PyTorch and walk you through how to build effective models for real world data.

Jun Rao is the cofounder of Confluent, a company that provides a streaming data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM’s Almaden research data center, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Apache Cassandra.

Presentations

A deep dive into Kafka controller Session

The controller is the brain of Apache Kafka and is responsible for maintaining the consistency of the replicas. We will first describe the main data flow in the controller. We then describe some of the recent improvements in the controller that handle certain edge cases correctly and allows for more partitions in a Kafka cluster.

Radhika Ravirala is a solutions architect at Amazon Web Services, where she helps customers craft distributed, robust cloud applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. Radhika enjoys spending time with her family, walking her dog, doing Warrior X-Fit, and playing an occasional hand at Smash Bros.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Anthony Reid is the Sr. Manager of Analytics with Komatsu and is an automation technology specialist with broad ranging experience in machine perception and control, distributed systems and data analytics. Antony is interested in the integration of automation and perception technologies to enhance interconnectivity and intelligence in machines, and currently works to drive analytics and intelligence of IoT & connected mining equipment at Komatsu.

Prior to working at Komatsu, Anthony worked as an Engineering lead at The University of Queensland, for the development of technologies to improve the safety and efficiency of mining operations. He was also responsible for the commissioning and testing of the P&H Payload system on two shovels in Australia and in the US and for various programming tasks to complete this work.

Anthony holds a PhD in Mechanical Engineering from the University of Queensland and currently resides in Milwaukee, Wisconsin.

Presentations

How Komatsu is Improving Mining Efficiencies using IoT and Machine Learning Session

Global heavy equipment manufacturer, Komatsu, is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Join Shawn Terry & Anthony Reid, to learn more about their data journey and how they are using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.

The system featured in this presentation is the invention of LaVonne Reimer, a lawyer-turned-entrepreneur with decades of experience building digital platforms for markets with identity and data privacy sensitivities. LaVonne was Founder and CEO of Cenquest, a venture-backed startup that provided the technology backbone for graduate schools such as NYU Stern School, London School of Economics, and UT Austin to offer branded degree programs online. More recently, she led a program to foster entrepreneurship in open source together with Open Source Development Labs (Linux), IBM, and Intel. The Open Authorization Protocol, initiated by members of this community, inspired her to begin work on governance and trust assurance for free-flowing data.

Presentations

Balancing stakeholder interests in personal data governance technology Session

GDPR asks us to rethink personal data systems--viewing UI/UX, consent management, and value-add data services through the eyes of subjects of the data. The opportunity in the $150B credit and risk industry is to deploy data governance technologies that balance the interests of individuals to control their own data with requirements for trusted data.

Randy Ridgley is a Solutions Architect on the Amazon Web Services Public Sector team. Previously, Randy worked for Walt Disney World in Orlando as the Principal Application Architect on their MagicBand platform, improving guest experience and cast coordination by building big data solutions based on AWS services. He has over 15 years of experience in the fields of Media & Entertainment, Casino Gaming and Publishing building real time streaming and big data analytics applications.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Steve Ross is the director of product management at Cloudera, where he focuses on security across the big data ecosystem, balancing the interests of citizens, data scientists, and IT teams working to get the most out of their data while preserving privacy and complying with the demands of information security and regulations. Previously, at RSA Security and Voltage Security, Steve managed product portfolios now in use by the largest global companies and hundreds of millions of users.

Presentations

Executive Briefing: Getting Your Data Ready for Heavy EU Privacy Regulations (GDPR ) Session

General Data Protection Regulation (GDPR) goes into effect in May 2018 for firms doing any business in the EU. However many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.

Nikki Rouda is the cloud and core platform director at Cloudera. Nik has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their IT challenges. His career spans big data, analytics, machine learning, AI, storage, networking, security, and the IoT. Nik holds an MBA from Cambridge and an ScB in geophysics and math from Brown.

Presentations

DIY vs. designer approaches to deploying data center infrastructure for machine learning and analytics Session

How to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble

Nuria Ruiz: (@pantojacoder)
Nuria began working for the Wikimedia Foundation Analytics team in December 2013. Before being part of the awesome project Wikipedia is, I spent time working in JavaScript, performance, mobile apps and web frameworks in the retail and social space. Most of my experience deploying large applications comes from the seven years that I worked at Amazon.com. I am a physicist by training and I started writing software in a Physical Oceanography Lab in Seattle. A long time ago. When Big Data was just called “science”.

Presentations

Data and Privacy at Scale at Wikipedia Session

The Wikipedia community feels strongly that you shouldn’t have to provide personal information to participate in the free knowledge movement. In this talk we will go into the challenges that this strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection and some creative workarounds that allow WMF to calculate metrics in a privacy conscious way.

Patty Ryan is an applied data scientist for Microsoft. She codes with our partners and customers to tackle tough problems using machine learning approaches, with sensor, text and vision data. She’s a graduate of University of Michigan. On Twitter: @singingdata

Presentations

When Tiramisu Meets Online Fashion Retail Session

Large online fashion retailers face the problem of efficiently maintaining catalogue of millions of items. Due to human error, it is not unusual that some items have duplicate entries. To trawl along such a large catalogue manually is near to impossible. How would you prevent such error? Find out how we applied deep learning as part of the solution.

Anand is a co-founder of Gramener, a data science company. He leads a team of data enthusiasts who tell visual stories of insights from analysis. These are built on the Gramener Visualisation Server.

He studied at IIT Madras, IIM Bangalore and LBS, and worked at IBM, Infosys, Lehman Brothers and BCG.

Profile: https://www.linkedin.com/in/sanand0/

Presentations

Mapping India Data Case Studies

Answering simple questions about India's geography can be a nightmare. What is the boundary of the postal code? Or a census block? Or even a constituency? The official answer resides in a set of manually drawn PDFs. But an active group of volunteers are crafting open maps. Their coverage and quality are such that it may enable the largest census exercise in the world in 2020.

Cloudera Systems Engineer

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.

Neelesh Srinivas Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by data scientists. He helps build services that are part of Stitch Fix’s Data Warehouse ecosystem.

Presentations

Tracking Data Lineage at Stitch Fix Session

This talk explain how we, at Stitch Fix, built a service to better understand the movement and evolution of data within the Data Warehouse from the initial ingestion from outside sources and through all of our ETLs. We talk about why we built the service, how we built it and the use cases that are benefitted by it.

Shioulin Sam is a research engineer at Cloudera Fast Forward Labs. In her previous life, she was an angel investor focusing on women-led start-ups. She also worked in the investment management industry designing quantitative trading strategies. She holds a Ph.D in Electrical Engineering and Computer Science from Massachusetts Institute of Technology.

Presentations

Semantic Recommendations Session

Recent advances in deep learning allow us to use the semantic content of items in recommendation systems, addressing a weakness of traditional methods. In this talk we explore limitations of classical approaches and look at how using the content of items can help solve common recommendation pitfalls such as the cold start problem, and open up new product possibilities.

Osman Sarood received his PhD in High Performance Computing from the Computer Science department at the University of Illinois Urbana Champaign in Dec 2013 where he focussed on load balancing and fault tolerance. Dr. Sarood has published more than 20 research papers in highly rated journals, conferences and workshops. He has presented his research at several academic conferences and has over 400 citations along with an i10-index and h-index of 12. He worked at Yelp from 2014 to 2016 as a Software Engineer where he prototyped, architected and implemented several key production systems that have been presented at various high profile conferences. He presented his work, Seagull, at the prestigious Amazon Web Services (AWS) annual conference, reInvent in 2015. He architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser, which was presented at AWS reInvent 2016. Dr. Sarood started working at Mist in 2016 and is leading the infrastructure team to help Mist scale the Mist Cloud in a cost effective and reliable manner.

Presentations

How to Cost Effectively and Reliably Build Infrastructure for Machine Learning Session

Mist consumes several Terabytes of telemetry data daily from its globally deployed wireless Access Points, a significant portion of which is consumed by ML algorithms. Last year, we saw 10X infrastructure growth. Learn how we are running 75% of our production infrastructure — reliably -- on AWS EC2 spot instances, which has kept our annual AWS cost to $1 million vs. $3 million (a 66% reduction).

Toru Sasaki is a system infrastructure engineer and leads OSS professional services team at NTT Data Corporation.
He is interested in open-source distributed computing systems, such as Apache Hadoop, Apache Spark and Apache Kafka.
He has designed and developed many clusters utilizing these products to solve his customer’s problems.
He is a co-author of one of the famous Apache Spark book written in Japanese.

Presentations

Best Practices to Develop an Enterprise Datahub to Collect and Analyze 1TB/day Data from a Lot of Services with Apache Kafka and Google Cloud Platform in Production Session

Recruit Group and NTT DATA Corporation developed a platform based on "Datahub" utilizing Apache Kafka. This platform should handle around 1TB/day application logs generated by a lot of services in Recruit Group. Some of the best practices and know-hows, such as schema evolution and network architecture, learned during this project are explained in this session.

Eric has worked in the data space for the past 10 years, starting with call center performance analytics at Merced Systems. He is currently working at Uber with the large volume of geospatial data helping people move in countries around the world.

Presentations

Marmaray – A generic, scalable, and pluggable Hadoop data ingestion & dispersal framework Session

Marmaray is a generic Hadoop ingestion and dispersal framework recently released to production at Uber. We will introduce the main features of Marmaray and business needs met, share how Marmaray can help a team's data needs by ensuring data can be reliably ingested into Hive or dispersed into online data stores, and give a deep dive into the architecture to show you how it all works.

Friederike Schüür is a research engineer at Cloudera Fast Forward Labs, where she imagines what applied machine learning in industry will look like in two years time; a time horizon that fosters ambition and yet provides grounding. She dives into new machine learning capabilities and builds fully functioning prototypes that showcase state-of-the-art technology applied to real use cases. She advises clients on how to make use of new machine learning capabilities, from strategy advising to hands-on collaboration with in-house technical teams. She earned a PhD in Cognitive Neuroscience from University College London and is a long-time data science for social good volunteer with DataKind.

Presentations

From Strategy to Implementation — Putting Data to Work at USA for UNHCR Session

The Hive and Cloudera Fast Forward Labs share how they transformed USA for UNHCR (UN Refugee Agency) to use data science and machine learning (DS/ML) to address the refugee crisis. From identifying use cases and success metrics to showcasing the value of DS/ML, we cover the development and implementation of a DS/ML strategy hoping to inspire other organizations looking to derive value from data.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. He is passionate about building combined big data and blockchain solutions. Over his career, Jim has held positions running operations, engineering, architecture, and QA teams in the financial services, regulatory, digital advertising, IoT, manufacturing, healthcare, chemicals, and geographical management systems industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

Using the blockchain in the enterprise Session

Jim Scott details relevant use cases for blockchain-based solutions across a variety of industries, focusing on a suggested architecture to achieve high-transaction-rate private blockchains and decentralized applications backed by a blockchain. Along the way, Jim compares public and private blockchain architectures.

I am a Solutions Architect supporting AWS partners in the Big Data space.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the Internet of Things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Executive Briefing: Managing successful data projects - technology selection and team building Session

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. In this session we'll provide guidance and best practices to help technical leaders deliver successful projects from planning to implementation.

Rama Sekhar is Principal at Norwest Venture Partners, where he focuses on early- to late-stage venture investments in enterprise and infrastructure, including cloud, big data, DevOps, cybersecurity, and networking. Rama’s current investments include Agari, Bitglass, and Qubole. Rama was previously an investor in Morta Security (acquired by Palo Alto Networks), Pertino Networks (acquired by Cradlepoint), and Exablox (acquired by StorageCraft). Before joining Norwest, Rama was with Comcast Ventures; a product manager at Cisco Systems, where he defined product strategy for the GSR 12000 Series and CRS-1 routers—$1B+ networking products in the carrier and data center markets; and a sales engineer at Cisco Systems, where he sold networking and security products to AT&T. Rama holds an MBA from the Wharton School of the University of Pennsylvania with a double major in finance and entrepreneurial management and a BS in electrical and computer engineering, with high honors, from Rutgers University.

Presentations

VC trends in machine learning and data science Session

In this panel, venture capital investors will discuss how startups can accelerate enterprise adoption of machine learning and what new tech trends will give rise to the next transformation in the Big Data landscape.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementations. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

The Future of ETL Isn’t What It Used To Be Session

Gwen Shapira will share design and architecture patterns that are used to modernize data engineering. We will see how Apache Kafka, microservices and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable and built to evolve.

Dave Shuman is the Industry lead for IoT & manufacturing at Cloudera. Dave has an extensive background in big data analytics, business intelligence applications, database architecture, logical and physical database design, and data warehousing. Previously, Dave held a number of roles at Vision Chain, a leading demand signal repository provider enabling retailer and manufacturer collaboration, including chief operations officer, vice president of field operations responsible for customer success and user adoption, vice president of product responsible for product strategy and messaging, and director of services. He also served at such top CG companies as Kraft Foods, PepsiCo, and General Mills, where he was responsible for implementations; was vice president of operations for enews, an e-commerce company acquired by Barnes and Noble; was executive vice president of management information systems, where he managed software development, operations, and retail analytics; and developed e-commerce applications and business processes used by Barnesandnoble.com, Yahoo, and Excite, and pioneered an innovative process for affiliate commerce. He holds an MBA with a concentration in information systems from Temple University and a BA from Earlham College.

Presentations

Using Machine Learning to Drive Intelligence at the Edge Session

The focus on IoT is turning increasingly to the edge. And the way to make the edge more intelligent is by building machine learning models in the cloud and pushing those learnings back out to the edge. Join Cloudera and Red Hat where they will showcase how they executed this architecture at one of the world’s leading manufacturers in Europe, including a demo highlighting this architecture.

Kamil Sindi is a Principal Engineer working on productionizing machine learning algorithms and scaling distributed systems. He received his Bachelor’s Degree in Mathematics with Computer Science from Massachusetts Institute of Technology.

Presentations

Building Turn-key Recommendations for 5% of Internet Video Session

Building a video recommendation model that serves millions of monthly visitors is a challenge in itself. At JW Player, we face the challenge of providing on-demand recommendations as a service to thousands of media publishers. We focus on how to systematically improve model performance while navigating the many engineering challenges and unique needs of the diverse publishers we serve.

Vartika Singh is a field data science architect at Cloudera. Previously, Vartika was a data scientist applying machine learning algorithms to real-world use cases, ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine learning techniques.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Jason ‘Jay’ Smith is a Cloud Customer Engineer at Google. He spends his day helping enterprises find ways move expand their workload capabilities on Google Cloud. Big Data is one of his passions as organizations find a way to collect, store, and analyse information. He is currently on the Kubeflow go-to-market team, helping people containerize machine learning to improve portability and scalability.

Presentations

From Training to Serving: Deploying Tensorflow Models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Join this tutorial to learn how to bring deep learning models from training to serving in a cloud production environment. You will learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Guoqiong Song(Ph.D) is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She is engaged in developing and optimizing distributed deep learning framework on Apache Spark.

Presentations

Job recommendations leveraging Deep Learning on Apache Spark with BigDL Session

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? In this session, we will demonstrate how to leverage BigDL on Apache Spark (A Distributed Deep learning framework for Apache Spark*) to predict a candidate’s probability of applying to specific jobs based on their resume.

Tim Spann has over a decade of experience with IoT, Big Data, Distributed Computing, Streaming technologies and Java programming. He has a BS and MS in Computer Science. He has been a Senior Solutions Architect at AirisData working with Spark and Machine Learning. Before that he was a Senior Field Engineer for Pivotal. He blogs for DZone where he is the Big Data Zone Leader. He runs a popular meetup in Princeton on Big Data, IoT, Deep Learning, Streaming, NiFi, Blockchain and Spark. He currently is a Solutions Engineer II at Hortonworks working with Apache Spark, Big Data, IoT, Machine Learning and Deep Learning. He is speaking at http://iotfusion.net/ and Data Works Summit Berlin this year. He has spoken at DataWorks Summit Sydney and Oracle Code NYC last year.

https://dzone.com/refcardz/introduction-to-tensorflow
http://www.meetup.com/futureofdata-princeton/
https://community.hortonworks.com/users/9304/tspann.html
https://dzone.com/users/297029/bunkertor.html
https://github.com/tspannhw

Presentations

IoT Edge Processing with Apache NiFi and MiniFi and Multiple Deep Learning Libraries Session

A hands-on deep dive on using Apache MiniFi with Apache MXNet and other Deep Learning Libraries on the edge device.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural Language Understanding at Scale with Spark NLP Tutorial

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable, open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Spark NLP in Action: How SelectData Uses AI to Better Understand Home Health Patients Session

This case study describes a question answering system, for accurately extracting facts from free-text patient records. The solution is based on Spark NLP - an open source extension of Spark ML, providing state of the art performance and accuracy for natural language understanding. We'll share best practices for training domain specific deep learning NLP models as such problems usually require.

Wangda Tan is Product Management Committee (PMC) member of Apache Hadoop and Staff Software Engineer at Hortonworks. His major working field is Hadoop YARN GPU isolation and resource scheduler, participated features like node labeling, resource preemption, container resizing etc. Before join Hortonworks, he was working at Pivotal, working on integration OpenMPI/GraphLab with Hadoop YARN. Before that, he was working at Alibaba cloud computing, participated creating a large scale machine learning, matrix and statistics computation platform using Map-Reduce and MPI.

Presentations

Deep learning on YARN - Running distributed Tensorflow / MXNet / Caffe / XGBoost on Hadoop clusters Session

In order to train deep learning/machine learning models, applications such as TensorFlow / MXNet / Caffe / XGBoost can be leveraged, we introduced new features in Apache Hadoop 3.x to better support deep learning workloads. (Such as GPU isolation, Docker support, etc.). This talk we will take a closer look at these improvements and show how to run these applications on YARN with demos.

Elena Terenzi is a software development engineer at Microsoft, where she brings business intelligence solutions to Microsoft Enterprise customers and advocates for business analytics and big data solutions for the manufacturing sector in Western Europe, such as helping big automotive customers implement telemetry analytics solutions with IoT flavor in their enterprises. She started her career with data as a database administrator and data analyst for an investment bank in Italy. Elena holds a master’s degree in AI and NLP from the University of Illinois at Chicago.

Presentations

When Tiramisu Meets Online Fashion Retail Session

Large online fashion retailers face the problem of efficiently maintaining catalogue of millions of items. Due to human error, it is not unusual that some items have duplicate entries. To trawl along such a large catalogue manually is near to impossible. How would you prevent such error? Find out how we applied deep learning as part of the solution.

Shawn Terry (Edmonton, AB, Canada) is the Lead Systems Architect for Komatsu Mining’s Analytics Platform. With more than 20 years experience as a software developer, consultant, and architect Shawn has spent the last 10 years working to design, develop, deploy and evolve Komatsu Mining’s Data Analytics platform. In 2016 he helped lead an effort to transform fragile, fragmented legacy systems into a truly scalable distributed solution built on open source and centered around Cloudera CDH and deployed in Microsoft Azure. The project was completed in under a year with a handful of developers and no increase in budget and enables Komatsu Mining to successfully partner with their customers in solving mining’s toughest challenges with smart, data-driven solutions.

Presentations

How Komatsu is Improving Mining Efficiencies using IoT and Machine Learning Session

Global heavy equipment manufacturer, Komatsu, is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Join Shawn Terry & Anthony Reid, to learn more about their data journey and how they are using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural Language Understanding at Scale with Spark NLP Tutorial

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable, open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

I succeed when assisting clients in developing solutions that have measurable business impact. With 25 years of field experience, I have developed solutions for multiple vertical industries including: banking/financial services, retail, life-sciences, and others.

Presentations

If You Thought Politics Was Dirty, You Should See the Analytics Behind It Session

Forget about the fake news, data and analytics in politics is what drives elections. While proposing analytical solutions to the RNC and DNC, I faced ethical dilemmas. Not only did I help causes I disagreed with, but I also armed politicians with “REAL-TIME” data to manipulate voters. Politics is a business, and today’s modern data infrastructure optimize campaign funds more effectively than ever.

Yaroslav Tkachenko is a software engineer interested in distributed systems, microservices, functional programming, modern cloud infrastructure and DevOps practices. Currently Yaroslav is a Senior Data Engineer at Activision, working on a large-scale data pipeline.

Prior to joining Activision Yaroslav held various leadership roles in multiple startups. He was responsible for designing, developing, delivering and maintaining platform services and cloud infrastructure for mission critical systems.

Presentations

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned Session

What can be easier than building a data pipeline? You add a few Apache Kafka clusters, some way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse... wait, it does start to look like A LOT of things, doesn't it? Join this talk to learn about the best practices we've been using for all the above.

Steven Totman is the financial services industry lead for Cloudera’s Field Technology Office, where he helps companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Prior to Cloudera, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents for data-integration and governance/metadata-related designs.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

As VP of Strategy, Jane works directly with clients to set the direction for the Unqork platform in both user experience and functionality.

She has been helping leaders in financial services assess and implement new business strategies since the start of her career. She worked at internal strategy teams for c-suites at JPMorgan Chase, Marsh, MetLife. Most recently, she advised a portfolio of startups for Techstars Connection in partnership with AB Inbev.

She holds a B.A. in Economics and Policy Studies from Syracuse University

Presentations

The balancing act: Building business relevant data solutions for the front line Findata

Data’s role in financial services has been elevated. However, often times the rollout of data solutions fail when an organization’s existing culture is misaligned with its capabilities. With Unqork, we’re increasing adoption by honoring existing capabilities. This discussion will explore methods to finally implement data solutions through both qualitative and quantitative discoveries.

Michelle Ufford leads the Data Platform Architecture Core team at Netflix, which focuses on platform innovation and usability. Previously, she led the data management team at GoDaddy, where she built data engineering solutions for personalization and helped pioneer Hadoop data warehousing techniques. Michelle is a published author, patented developer, award-winning open source contributor, and Most Valuable Professional (MVP) for Microsoft Data Platform. You can find her on Twitter at @MichelleUfford.

Presentations

Data @ Netflix: See What’s Next Session

In this talk, Michelle Ufford will share some cool things Netflix is doing with data and the big bets we’re making on data infrastructure. Topics will include workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, and platform intelligence.

Sandeep Uttamchandani is a Distinguished Engineer at Intuit, focussing on platforms for storage, databases, analytics, and machine learning. Prior to Intuit, Sandeep was co-founder and CEO of a machine learning startup focussed on finding security vulnerabilities in Cloud Native deployment stacks. Sandeep has nearly two decades of experience in storage and data platforms, and has held various technical leadership roles at VMware and IBM. Over his career, Sandeep has contributed to multiple enterprise products, and holds 35+ issued patents, 20+ conference and journal publications, and regularly blogs on All-things-Enterprise-Data. He has a Ph.D. from University of Illinois at Urbana-Champaign.

Presentations

Circuit-breakers to safeguard for garbage in, garbage out Session

Do your analysts always trust the insights generated by your Data Platform? Ensuring insights are always reliable is critical for use-cases in the Financial Sector. Similar to a circuit-breaker design pattern used in Service Architectures, this talk describes a circuit-breaker pattern we developed for data pipelines. We are able to detect/correct problems and ensure always reliable insights!

Preeti Vaidya is a Data Science Professional – working on solving real world problems with novel computational techniques.

She holds a MS in CS with a focus on Machine Learning, Columbia SEAS ’15, and her research interests include Big Data, Machine Learning, Graph Analytics, Parallel and Distributed Computing Systems, and Image Processing.

Her work has been published in IEEE Explore, as well as other technical journals, and has been an invited to present her work at technical conferences.

Presentations

Agility to Data Product Development: Plug and Play Data Architecture Session

Data Products, different from Data-Driven Products, are finding their own place in organizational Data. Driven Decision Making. Shifting the focus to “data”, opens up new opportunities. The presentation, with case studies, dives deeper into a layered implementation architecture, provides intuitive learnings and solutions that allow for more agile, reusable data modules for a data product team.

Balaji works on the Hudi project at Uber and oversees data engineering broadly across network performance monitoring domain. Previously, he was one of the lead engineers on Linkedin’s databus change capture system as well as Espresso (NoSQL store). Balaji’s interests broadly lie in distributed data systems.

Presentations

Hudi : Unifying storage & serving for batch & near real-time analytics Session

Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers share the design, architecture & use-cases of the second generation of ‘Hudi’, an analytical storage engine designed to serve such needs and beyond.

Abhi leads Applied Technology Innovation at Novartis and is responsible for experimenting, incubating and rapidly scaling digital platforms and services across the enterprise.

Over the years as a serial intrapreneur he has established platforms and services in Real World Evidence, Robotic & Cognitive Automation, Advanced Analytics, Data Platforms, Standards and wearables in clinical trials.

Abhi is passionate about and driven by harnessing the intersection of science, technology, data and people to improve patient’s lives. Abhi has also been involved in deploying technology for public health projects in underdeveloped regions.

Passionate about cycling, comics, science fiction and history, Abhi holds a MBA from the Indian School of Business a Masters in Science in pharmaceutical medicine and is an engineer. He lives in Basel, Switzerland currently.

Presentations

Crossing the chasm: Case study of realizing a big data strategy Data Case Studies

A case study on how a transformational business opportunity was realized on the foundation of an integrated data, process, culture, organization and technology strategy

Tim began his 21-year career as an IT Consultant with ICL (and then Fujitsu). He spent time working in Seattle for Microsoft, 3 years at the European Commission in Luxembourg, 3 years for HP, and now he works for BJSS as a hands-on Cognitive Architect.

Tim has always had passion for systems integration and is always looking for clever and innovative ways to connect systems together.

After joining BJSS, Tim become Head of Mobile where he has showed his passion for design, development and delivery of quality mobile applications.

Since then, Chabot’s, artificial intelligence and machine learning have sparked Tim’s interest, and he has focused his attention on this exciting area. As a Cognitive Architect, he is very excited to be designing complex, vendor agnostic, multi-lingual cloud based Chatbot solutions for a range of clients.

Tim has 2 grown up daughters and lives in Newington Green in London. He enjoys choral singing and is currently working on renovating and extending his 1850’s property.

Presentations

Using big data to unlock the delivery of personalized, multi-lingual real-time chat services for global financial service organizations Session

Financial Service clients demand increased data driven personalization, faster insight-based decisions and multi-channel, real-time access. BJSS discusses how organizations can deliver real time, vendor-agnostic, personalized chat services. We discuss the issues around security, privacy, legal sign-off, data compliance and how the Internet of Things can be used as a delivery platform.

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. We will explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Dean Wampler, Ph.D., is the VP of Fast Data Engineering at Lightbend. He leads the development of Lightbend Fast Data Platform, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly Media. He is a contributor to several open source projects, a frequent Strata speaker, and the co-organizer of several conferences around the world and several user groups in Chicago. Dean lurks on Twitter as @deanwampler.

Presentations

Executive Briefing: What You Need to Know About Fast Data Session

Streaming data systems, so called "Fast Data", promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just "faster" versions of Big Data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. This talk tells you what you need to know to exploit Fast Data successfully.

Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams Tutorial

This hands-on tutorial builds streaming apps as microservices using Kafka with Akka Streams and Kafka Streams. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll be better informed when choosing tools for your needs. We'll contrast them with Spark Streaming and Flink, including when to chose them instead. The sample apps demonstrate ML model serving ideas.

Jason is a software engineer at Cloudera focusing on the cloud.

Presentations

Comparative Analysis of the Fundamentals of AWS and Azure Session

The largest infrastructure paradigm change of the 21st Century is the shift to the cloud. Companies are faced with the difficult and daunting decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. In this talk we use our experience from building production services on AWS and Azure to compare their strengths and weaknesses.

Jacob Ward is a science and technology correspondent for CNN, Al Jazeera, and PBS. The former editor-in-chief of Popular Science magazine, Ward writes for The New Yorker, Wired, and Men’s Health. His ten-episode Audible podcast, Complicated, discusses humanity’s most difficult problems, and he’s the host of an upcoming four-hour public television series, Hacking Your Mind, about human decision making and irrationality. Ward is developing a CNN original series about the unintended consequences of big ideas, and is a 2018-2019 Berggruen Fellow at Stanford University’s Center for Advanced Study in the Behavioral Sciences, where he’s writing a book, due for publication by Hachette Books in 2020, about how artificial intelligence will amplify good and bad human instincts.

Presentations

Black Box: How AI Will Amplify the Best and Worst of Humanity Keynote

For most of us, our own mind is a black box: an all-powerful and utterly mysterious device that runs our lives for us. And not only do we humans just barely understand how it works, science is now revealing that it makes most of our decisions for us using rules and shortcuts of which you and I aren’t even aware.

I have a Research Fellowship in Physics at Harvard University, studying quantum metrology and quantum computing. My PhD research, in the field of quantum computing, was published in Nature and covered in the New York Times. I moved into artificial intelligence as the CEO of ASI Data Science because I believe it is the most exciting, and important, field of our time. ASI’s thesis is that the way to build the most valuable company, and add the most value to humanity, is to bring artificial intelligence to the real world; to schools, governments, businesses and hospitals. We intend to pursue this mission for many years.

Presentations

Predicting residential occupancy and hot water usage from high frequency, multi-vector utilities data Session

Future Home Energy Management Systems could improve their energy efficiency by predicting resident needs through utilities data. This session discusses the opportunity with a particular focus on the key data features, the need for data compression and the data quality challenges.

Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Katharina has a passion for travel and trying out new things that have taken her from living in four countries to overnighting in the Moroccan desert, hiking to Machu Picchu, volunteering in Kenya, and becoming a certified yoga teacher in Bali. As Head of Data Analytics and Performance Marketing at EveryMundo, Katharina works with airlines around the world on innovating based on actionable insights from analytics. She is responsible for defining data standards, specifying data collection systems, implementing the tracking environment, analyzing data quality, and supporting airline’s digital strategy leveraging automation and data science. Everything she does is multi-lingual as she speaks German, French, Spanish and English. Her long-term intention is to join the Data for Good movement which encourages using data in meaningful ways to solve humanitarian issues around poverty, health, human rights, education and the environment.

Presentations

Self-reliant, secure, end-to-end data, activity, and revenue analytics - Roadmap for the airline industry Data Case Studies

Airlines want to know what happens after a user interacts with our technology on their website. Do they convert? Do they close the browser and come back later? Previously depending on airline’s analytics tools to prove value, Katharina explores how to implement a client-independent end-to-end tracking system.

Robin Way is a Faculty Member for Banking at the International Institute of Analytics, and is the founder and President of the management analytics consultancy, Corios. He has over 25 years of experience in the design, development, execution, and improvement of applied analytics models for clients in the credit, payments, lending, brokerage, insurance and energy industries. Robin was previously employed with SAS® Institute’s Financial Services Business Unit as a managing analytics consultant for 12 years, in addition to another 10+ years in analytic management roles for several client-side and consulting firms.

Robin is author of Skate Where The Puck’s Headed: A Playbook for Scoring Big with Predictive Analytics. He lives in Portland, Oregon with his wife, Melissa and two sons, Colin and Liam. In his spare time, Robin plays soccer and holds a black belt in taekwondo.
Robin’s professional passion is devoted to democratizing and demystifying the science of applied analytics. His contributions to the field correspondingly emphasize statistical visualization, analytical data preparation, predictive modeling, time series forecasting, mathematical optimization applied to marketing, and risk management strategies. Robin’s undergraduate degree from the University of California at Berkeley and his subsequent graduate-level coursework emphasized the analytical modeling of human and consumer behavior.

Presentations

Leading Next Best Offer Strategies for Financial Services Findata

This session will present case study examples of next best offer strategies, predictive customer journey analytics, and behavior-driven time-to-event targeting for mathematically-optimal customer messaging that drives incremental margins.

Chief Data Officer of Goldman Sachs

Presentations

Keynote with Jeffrey Wecker Keynote

Jeffrey Wecker, Chief Data Officer, Goldman Sachs

Daniel Weeks manages the Big Data Compute team at Netflix and is a Parquet committer. Prior to joining Netflix, Daniel focused on research in big data solutions and distributed systems.

Presentations

The evolution of Netflix's S3 data warehouse Session

In the last few years, Netflix's data warehouse has grown to more than 100PB in S3. This talk will summarize what we've learned, the tools we currently use and those we've retired, as well as the improvements we are rolling out, including Iceberg, a new table format for S3.

Thomas is Software Engineer, Streaming Platform at Lyft and Apache Apex PMC Chair. Earlier he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a Co-Founder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several other of the ecosystem projects. Thomas has worked on distributed systems for over 20 years, speaker at international big data conferences and author of the book “Learning Apache Apex”.

Presentations

Near-real time Anomaly Detection at Lyft Session

Consumer facing real-time processing poses a number of challenges to protect from fraudulent transactions and other risks. The streaming platform at Lyft seeks to support this with an architecture that brings together a data science friendly programming environment with a deployment stack for the reliability, scalability and other SLA requirements of a mission critical stream processing system.

Mike Wendt is an Engineering Manager in the AI Infrastructure group at NVIDIA. His research work has focused on leveraging GPUs for big data analytics, data visualizations, and stream processing. Prior to joining NVIDIA, Mike led engineering work on big data technologies like Hadoop, Datastax Cassandra, Storm, Spark, and others. In addition, Mike has focused on developing new ways of visualizing data and the scalable architectures to support them. Mike holds a BS in computer engineering from the University of Maryland.

Presentations

Accelerating Financial Data Science Workflows With GPU Session

GPUs have allowed financial firms to run complex simulations, train myriads of models, and data mine at unparalleled speeds. Today, the bottleneck has moved completely to ETL. With the GPU Open Analytics Initiative (GoAi), we’re accelerating ETL and keeping the entire workflow on GPUs. We’ll discuss real-world examples, benchmarks, and how we’re accelerating our largest FS customers.

Masha completed her PhD in Cognitive Science at New York University in 2015. She is now Director of the Investopedia Data Science team, where she works to answer questions such as “What can Investopedia’s readership tell us about current market sentiment?” and “What financial concepts are most interesting to American investors, from Wall Street to Silicon Valley?”

Presentations

Anxiety at scale: How Investopedia used readership data to track market volatility Session

As our businesses rely more heavily on user data to power our sites, products, and sales, can we give back by sharing those insights with users? Learn how Investopedia harnessed reader data to build an index that tracks market anxiety and moves with the VIX, a proprietary measure of market volatility. We’ll focus on thinking outside the box to turn data into tools for users, not just stakeholders.

Hee Sun Won is a principal researcher at the Electronic and Telecommunications Research Institute (ETRI) and leads the Collaborative Analytics Platform for BDaaS (big data as a service) and analytics for the Network Management System (NFV/SDN/cloud). Her research interests include multitenant systems, cloud resource management, and big data analysis.

Presentations

A Data Marketplace Case Study with Blockchain and Advanced Multitentant Hadoop in Smart Open Data Platform Session

This session will address how analytics services in data marketplace systems can be performed on one single Hadoop cluster across distributed data centers. We extend the overall architecture of Hadoop ecosystem with blockchain so that multiple tenants and authorized third parties can securely access data to perform various analytics while still maintaining the privacy, scalability and reliability.

Brian has been an engineer on the AppNexus Optimization team for five years. During his tenure at AppNexus, Brian has worked closely with budgeting, valuation and allocation systems and has seen great changes and great mistakes. Coming from a pure mathematics background, Brian enjoys working on algorithm, logic and streaming data problems with his team. In addition to control systems, data technologies and real-time applications, Brian loves talking about process, team work, management, sequencers, synthesizers and the NYC music scene.

Presentations

AppNexus's Stream-based Control System for Automated Buying of Digital Ads Session

Automating the success of digital ad campaigns is complicated and comes with the risk of wasting the advertiser's budget or a trader's margin and time. This talk describes the evolution of Inventory Discovery, a streaming control system of eligibility, prioritization and real-time evaluation that helps digital advertisers hit their performance goals with AppNexus.

Tony Wu manages Altus Core Engineering team at Cloudera. Perviously Tony is a team lead of the Partner Engineering team at Cloudera, and is responsible for Microsoft Azure integration for Cloudera Director.

Presentations

Comparative Analysis of the Fundamentals of AWS and Azure Session

The largest infrastructure paradigm change of the 21st Century is the shift to the cloud. Companies are faced with the difficult and daunting decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. In this talk we use our experience from building production services on AWS and Azure to compare their strengths and weaknesses.

Running multidisciplinary big data workloads in the cloud Tutorial

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.

Jerry Xu is cofounder and CTO at Datatron Technologies. An innovative software engineer with extensive programming and design experience in storage systems, online services, mobile, distributed systems, virtualization, and OS kernels, Jerry also has a demonstrated ability to direct and motivate a team of software engineers to complete projects meeting specifications and deadlines. Previously, he worked at Zynga, Twitter, Box, and Lyft, where he built the company’s ETA machine learning model. Jerry is the author of open source project LibCrunch. He is a three-time Microsoft Gold Star Award winner.

Presentations

Infrastructure for deploying machine learning to production: lessons and best practices in large financial institutions Session

Large financial institutions have many data science teams (for eg: fraud, credit risk, marketing) often using diverse set of tools to build predictive models. There are many challenges involved Production-zing these predictive AI models. This talk will cover challenges and lessons learnt deploying AI models to production in large financial institutions.

Invented new ways to recognise people by the way the move in the early days, now, main focus is to apply machine learning techniques into solving various problems in daily life.

Presentations

When Tiramisu Meets Online Fashion Retail Session

Large online fashion retailers face the problem of efficiently maintaining catalogue of millions of items. Due to human error, it is not unusual that some items have duplicate entries. To trawl along such a large catalogue manually is near to impossible. How would you prevent such error? Find out how we applied deep learning as part of the solution.

Longqi Yang is a 4th year Ph.D. candidate of Computer Science at Cornell Tech and Cornell University. He is a member of the Connected Experiences lab and the small data lab where he is advised by Prof. Deborah Estrin. His current research focuses are building frameworks, systems and algorithms that bring deeper user and content understanding into recommendations, and his work has been published and presented in the top academic conferences, such as WWW, WSDM and CIKM. He is the organizer and the speaker of the 2017 Workshop on “Immersive Recommendation: Deep User and Content Modeling for Personalization” at the NYC Media Lab annual summit.

Presentations

Harnessing and Customizing State-of-the-art Recommendation Solutions with OpenRec Session

State-of-the-art recommendation algorithms are increasingly complex and no longer one-size fits all. Current monolithic development practice poses significant challenges to rapid, iterative, and systematic, experimentation. This talk will demonstrate how researchers and practitioners can use OpenRec, an open source framework, to easily customize state-of-the-art solutions for diverse scenarios.

Juan Yu is a software engineer at Cloudera working on the Impala project, where she helps customers investigate, troubleshoot, and resolve escalations and analyzes performance issues to identify bottlenecks, failure points, and security holes. Juan also implements enhancements in Impala to improve customer experience. Previously, Juan was a software engineer at Interactive Intelligence and held developer positions at Bluestreak, Gameloft, and Engenuity.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Nir Yungster leads the Data Science team at JW Player, where the team focuses on building recommendation engines as a service to online video publishers. He received his Bachelor’s Degree in Aerospace Engineering from Princeton and his Master’s in Applied Mathematics from Northwestern University.

Presentations

Building Turn-key Recommendations for 5% of Internet Video Session

Building a video recommendation model that serves millions of monthly visitors is a challenge in itself. At JW Player, we face the challenge of providing on-demand recommendations as a service to thousands of media publishers. We focus on how to systematically improve model performance while navigating the many engineering challenges and unique needs of the diverse publishers we serve.

Varant Zanoyan is a software engineer on the ML Infrastructure team at Airbnb where he works on tools and frameworks for building and productionizing ML models. Before Airbnb, he worked on solving data infrastructure problems at Palantir Technologies.

Presentations

Zipline - Airbnb's Data Management Platform for Machine Learning Session

Zipline is Airbnb’s soon to be open sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days, and offers features to support end-to-end data management for machine learning. This talk covers architecture and dives into how Zipline solves ML specific problems.

Xiaohan Zeng is a Software Engineer on the Machine Learning Infrastructure team at Airbnb. He majored in Chemical Engineering at Tsinghua University and Northwestern University, but started to pursue a career in software engineering and machine learning after doing research in data science. Prior to joining Airbnb, he worked on the Machine Learning Platform team at Groupon for 3 years. Outside work, he enjoys reading, writing, traveling, movies, and trying to follow his daughter around when she suddenly decides to practice walking.

Presentations

Bighead: Airbnb's End-to-End Machine Learning Platform Session

We introduce Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Bighead integrates popular libraries including Tensorflow, XGBoost, and PyTorch. It is built on Python, Spark, and Kubernetes, and is designed be used in modular pieces. It has reduced the overall model development time from many months to days at Airbnb.

Wenjing Zhan is the data scientist at Talroo in charge of the predictive machine learning. Previously, Wenjing aided in search relevance through classification modeling. Wenjing has experience related to data engineering of Apache Spark and machine learning in Scala, R, and Python. She holds a masters in statistics from University of Texas at Austin.

Presentations

Job recommendations leveraging Deep Learning on Apache Spark with BigDL Session

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? In this session, we will demonstrate how to leverage BigDL on Apache Spark (A Distributed Deep learning framework for Apache Spark*) to predict a candidate’s probability of applying to specific jobs based on their resume.

Mang Zhang is a big data platform development engineer at JD.com, where he is mainly engaged in the construction and development of the company’s big data platform, using open source projects such as Hadoop, Spark, Hive, Alluxio and Presto. He focuses on the big data ecosystem and is an open source developer, the contributor of Alluxio, Hadoop, Hive and Presto.

Presentations

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks Session

Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.

Currently leading an excellent engineering team to provide big data services (HDFS/YARN/Spark/TensorFlow and beyond) to power LinkedIn’s business intelligence and relevance applications.

Apache Hadoop PMC member; led the design and development of HDFS Erasure Coding (HDFS-EC).

Presentations

TonY -- Native support of TensorFlow on Hadoop Session

We have developed TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. TonY's native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop including MapReduce and Spark.

Xiaoyong Zhu is a Program Manager working in Microsoft Cloud AI group. His current focus is on building scalable deep learning algorithms.

Presentations

Deep Learning on audio in Azure to detect sounds in real-time Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds (dog bark, alarms, people calling from behind etc.,). We all take this for granted, there are over 360 million in this world who are deaf or hard of hearing. How can we make the auditory world inclusive as well as meet the great demand in other sectors by applying deep learning on audio in Azure?