Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Speakers

Experts and innovators from around the world share their insights and best practices. New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Mohamed AbdelHady is a senior data scientist on the algorithms and data science (ADS) team within the AI+R Group at Microsoft, where he focuses on machine learning applications for text analytics and natural language processing. Mohamed works with Microsoft product teams and external customers to deliver advanced technologies that extract useful and actionable insights from unstructured free text such as search queries, social network messages, product reviews, customer feedback. Previously, he spent three years at Microsoft Research’s Advanced Technology Labs. He holds a PhD in machine learning from the University of Ulm in Germany.

Presentations

Deep learning for domain-specific entity extraction from unstructured text Session

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.

Dave Abercrombie is a senior staff engineer at Sharethrough, where his database ingests dozens of terabytes of semistructured data daily to maintain a petabyte database with near perfect referential integrity. Dave approaches BI from a database perspective. His specialities are data integrity, ETL, robust dimensional design, and both logical and physical database design. He has two decades experience in database engineering, with the last six years focused on business intelligence on very large databases.

Presentations

The Snowflake data warehouse: How Sharethrough analyzes petabytes of event data in a SQL database (sponsored by Snowflake) Session

Dave Abercrombie explains how Sharethrough used Snowflake to build an analytic and reporting platform that handles petabyte-scale data with ease.

Anthony Accardo is director of applied R&D and advanced development for media networks production, distribution, marketing, analytics, and digital media at Disney. His areas of focus include data, metadata, machine learning and artificial intelligence, digital media products, UI and UX, animation technologies, video operations, content development, content research, data science, game engines, AR, and VR.

Presentations

How content data can support content intelligence Media and Ad Tech

Anthony Accardo details how his team at Disney creates rich metadata describing episodic television content and leverages it for content intelligence across multiple points in the content lifecycle.

Vijay Srinivas Agneeswaran is a senior director of technology at Publicis Sapient. Vijay has spent the last 12 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine learning algorithms, such as LDA and random forests, over Spark. He also led a team that build a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He is a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including _IEEE Transactions.

His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies.

Vijay holds a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.

Presentations

Achieving GDPR compliance and data privacy using blockchain technology Session

Ajay Mothukuri and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.

Ask Me Anything: Deep learning-based search and recommendation systems using TensorFlow Session

Join Vijay Srinivas Agneeswaran and Abhishek Kumar to discuss recommender systems—particularly deep learning-based recommender systems in TensorFlow—or ask any other questions you have about deep learning.

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

John Mark Agosta is a principal data scientist at Microsoft, where he leads a team that is expanding the machine learning and artificial intelligence capabilities of Azure. Previously, John worked with startups and labs in the Bay Area, including “The Connected Car 2025” at Toyota ITC, peer-to-peer malware detection at Intel, and automated planning at SRI. His dedication to probability and AI led him to found an annual applications workshop for the Uncertainty in AI conference. When feeling low, he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Presentations

Distributed clinical models: Inference without sharing patient data Session

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Ritesh Agrawal leads the intelligent infrastructure systems team at Uber, which focuses on scaling data infrastructure for Uber’s growing business needs now and foreseeable in the future. A leading data scientist for optimizing infrastructure, previously, Ritesh specialized in predictive and ranking models at Netflix, AT&T Labs, and Yellow Pages, where he built scalable machine learning infrastructure with technologies such as Docker, Hadoop, and Spark. He holds a PhD in environmental earth science from Pennsylvania State University, where his thesis focused on computational tools and technologies such as concept map ontologies.

Presentations

Presto query gate: Identifying and stopping rogue queries Session

Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money.

Adam Ahringer is a software development manager on the data platform team at Disney-ABC TV Digital Media, where he works on the platform that supports ABC’s streaming applications. His experience includes systems engineering, database administration, software development, and technical architecture. He has called Seattle home for many years but grew up in the complete opposite corner of the US—Miami.

Presentations

Analytics in real time, the (Grey's) anatomy of event streaming (sponsored by MemSQL) Session

Adam Ahringer explains how Disney-ABC TV leverages Amazon Kinesis and MemSQL to provide real-time insights based on user telemetry as well as the platform for traditional data warehousing activities.

Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems is the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Foundations of streaming SQL; or, How I learned to love stream and table theory Session

What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general.

Sridhar Alla is director of data science and engineering at Comcast. A big data expert, over his career, Sridhar has helped companies large and small solve complex problems such as data warehousing, governance, security, real-time processing, high-frequency trading, and establishing large-scale data science practices. Previously, he was the chief technology officer at cybersecurity firm eIQNetworks and a storage software engineer at Network Appliance. Sridhar is a certified Agile DevOps practitioner and implementer. He is an avid presenter at conferences including Strata + Hadoop World and Spark Summit. Sridhar also provides onsite and online training for several technologies. He has several patents filed with the US PTO on large-scale computing and distributed systems. Sridhar holds a bachelor’s degree in computer science from JNTU in Hyderabad, India. He lives with his wife in New Jersey.

Presentations

Improving the customer experience via Click Through Analytics Media and Ad Tech

Personalization is becoming very prevalent on all sorts of interactive platform and is increasingly being used as an effective method to enhance the customer experience on the platform. Now a days , Set Top Boxes are becoming more advanced than ever with a lot of interactive personalizable content.

Improving the customer experience via clickthru analytics Media and Ad Tech

Sridhar Alla explores technologies and methodologies to gain insights into customer experience in order to understand what content works best and explains how to personalize content to enhance customer experience.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Executive Briefing: What does an exec need to know about architecture and why Session

There's been an explosion of new architectures, but is this because engineers love new things or is there a good business reason for these changes? Jesse Anderson explores new architectures and the actual business problems they solve. You may find out that your team would be far more productive if you moved to these architectures.

Real-time systems with Spark Streaming and Kafka 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time systems with Spark Streaming and Kafka (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

André Araujo is a principal solutions architect at Cloudera. An experienced consultant with a deep understanding of the Hadoop stack and its components and a methodical and keen troubleshooter who loves making things run faster, André is skilled across the entire Hadoop ecosystem and specializes in building high-performance, secure, robust, and scalable architectures to fit customers’ needs.

Presentations

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments Tutorial

New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR.

Michael Armbrust is the lead developer of the Spark SQL and Structured Streaming projects at Databricks. Michael’s interests broadly include distributed systems, large-scale structured storage, and query optimization. Michael holds a PhD from UC Berkeley, where his thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence.

Presentations

Streaming big data in the cloud: What to consider and why Session

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud.

Amr Awadallah is the cofounder and CTO at Cloudera. Previously, Amr was an entrepreneur in residence at Accel Partners, served as vice president of product intelligence engineering at Yahoo, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr’s first startup, VivaSmart, was acquired by Yahoo in July 2000. Amr holds bachelor’s and master’s degrees in electrical engineering from Cairo University, Egypt, and a PhD in electrical engineering from Stanford University.

Presentations

Automating decisions with data in the cloud Keynote

Amr Awadallah explains why the cloud requires a different approach to machine learning and analytics and what you can do about it.

Shivnath Babu is an associate professor of computer science at Duke University, where his research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. He is also the chief scientist at Unravel Data Systems, the company he cofounded to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has received a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award. He has given talks and distinguished lectures at many research conferences and universities worldwide. Shivnath has also spoken at industry conferences, such as the Hadoop Summit.

Presentations

Using machine learning to simplify Kafka operations Session

Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka.

Dorna Bandari is the director of algorithms at AI-driven prediction platform Jetlore, where she leads development of large-scale machine learning models and machine learning infrastructure. Previously, she was a lead data scientist at Pinterest and the founder of ML startup Penda. Dorna holds a PhD in electrical engineering from UCLA.

Presentations

Building​ ​a​ ​flexible​ ​ML​ ​pipeline​ ​at​ ​a​ ​B2B​ ​AI​ ​start​up Session

Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth.

Burcu Baran is a senior data scientist on LinkedIn’s analytics data mining team. Burcu is passionate about bringing mathematical solutions to business problems using machine learning techniques. Previously, she worked on predicting modeling at a B2B business intelligence company and was a postdoc in the Mathematics Departments at both Stanford and the University of Michigan. Burcu holds a PhD in number theory.

Presentations

Ask Me Anything: Big data and machine learning techniques to drive and grow business Session

Join Burcu Baran and Wei Di to discuss big data in business analytics, machine learning in business analytics, and achieving actionable insights from big data.

Big data analytics and machine learning techniques to drive and grow business Tutorial

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.

Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Previously, Roger was in the Cloud Machine Learning Group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.

Presentations

Continuous machine learning over streaming data Session

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.

Stephanie Beben is a chief technologist at Booz Allen Hamilton specializing in data science and big data technology consulting with a passion for mentoring, leadership, and team building. She possesses over a decade of research and development experience deriving value from massive datasets, utilizing tools such as Hadoop, Python, and Splunk. Her data science expertise includes data mining, machine learning, rapid prototypes, and data visualization techniques across markets including cyber and mobile technologies, healthcare, sports, and energy. Stephanie currently applies this expertise as a leader in Booz Allen’s Strategic Innovation Group, consulting on technical strategy and growth of data science capabilities and teams within large organizations.

Presentations

The mathematical corporation: A new leadership mindset for the machine intelligence era Session

How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Stephanie Beben shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.”

James Bednar is a senior solutions architect at Anaconda. Previously, Jim was a lecturer and researcher in computational neuroscience at the University of Edinburgh, Scotland, and a software and hardware engineer at National Instruments. He manages the open source Python projects datashader, HoloViews, GeoViews, ImaGen, and Param. He has published more than 50 papers and books about the visual system, data visualization, and software development. Jim holds a PhD in computer science from the University of Texas as well as degrees in electrical engineering and philosophy.

Presentations

Custom interactive visualizations and dashboards for one billion datapoints on a laptop in 30 lines of Python Tutorial

Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code.

Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services, where he focuses on AI and real-time streaming technologies and working with AWS customers to build data-driven products (whether batch or real time) and create solutions powered by ML in the cloud. Roy has worked in the data and analytics industry for over a decade and has helped hundreds of customers bring compelling data-driven products to the market. He serves on the advisory board of Applied Mathematics and Data Science at Post University in Connecticut. Roy holds a BSc in information systems and an MBA from the University of Georgia.

Presentations

The real-time journey from raw streaming data to AI-based analytics Session

Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution.

Valentin “Val” Bercovici is founder and CEO at PencilDATA, democratizing trust throughout digital transformation. Val is also cofounder and a senior advisor at Peritus.ai, a company focused on completing the autonomous data center vision by addressing the gap in automated tech support via machine learning. He was a founding member of the governing board at the Cloud Native Compute Foundation (CNCF), the Linux Foundation’s home for Google’s Kubernetes, the Open Container Initiative (OCI), and many other related cloud-native projects. Val has enjoyed a long leadership career. Previously, at NetApp/SolidFire, he launched multibillion-dollar storage and compliance products, created the competitive team and strategy, directed new research investments for the NetApp Data Fabric roadmap, and served as SolidFire’s CTO. A pioneer in the cloud industry, Val led the creation of NetApp’s cloud strategy and introduced the first international cloud standard to the marketplace as CDMI (ISO INCITS 17826) in 2012. Val advises numerous data-driven startups and is passionate about improving diversity within the tech industry. He has several patents issued and pending around data center applications of augmented reality and data authenticity.

Presentations

Supply chain evolution from horseless buggies to driverless cars Data Case Studies

Valentin Bercovici explores the challenges in securing, maintaining, and repairing the dynamic, heterogeneous software supply chain for modern self-driving cars, from levels 0 to 5. Along the way, Valentin reviews implementation options, from centralized certificate authority-based architectures to decentralized blockchains networked over a fleet of cars.

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer experience. Tim can frequently be found at speaking at conferences internationally and in the United States. He is the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio Podcast. He lives in Littleton, Colorado, with the wife of his youth and their youngest child, the other two having mostly grown up.

Presentations

Stream processing with Kafka Tutorial

Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data.

Brian Bloechle is an industrial mathematician and data scientist as well as a technical instructor at Cloudera.

Presentations

Data science and machine learning with Apache Spark 2-Day Training

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Data science and machine learning with Apache Spark (Day 2) Training Day 2

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Ron Bodkin is a technical director on the applied artificial intelligence team at Google, where he provides leadership for AI success for customers in Google’s Cloud CTO office. Ron engages deeply with Global F500 enterprises to unlock strategic value with AI, acts as executive sponsor with Google product and engineering to deliver value from AI solutions, and leads strategic initiatives working with customers and partners. Previously, Ron was the founding CEO of Think Big Analytics, a company that provides end-to-end support for enterprise big data, including data science, data engineering, advisory, and managed services and frameworks such as Kylo for enterprise data lakes. When Think Big was acquired by Teradata, Ron led global growth, the development of the Kylo open source data lake framework, and the company’s expansion to architecture consulting; he also helped create Teradata’s artificial intelligence incubator.

Presentations

Deploying deep learning with TensorFlow Tutorial

TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin and Brian Foo to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

San Francisco-based software engineer, authNZ geek, data geek, and graph geek Ryan Boyd is director of developer relations for Neo4j, an open source graph database that powers connected data analysis in data journalism, cancer resource, and some of the world’s top companies. Previously, he was head of developer relations for Google Cloud Platform and worked on over 20+ different APIs and developer products during his eight years at Google. Ryan is the author of Getting Started with OAuth 2.0 by O’Reilly. Now that he has a young daughter, he no longer skydives but still enjoys the adventures of sailing and cycling.

Presentations

Graph analysis of 200,000 tweets from Russian Twitter trolls Session

Ryan Boyd explains how he and his team reconstructed a subset of the Twitter network of Russian troll accounts and applied graph analytics to the data using the Neo4j graph database to uncover how these accounts were spreading fake news.

David Boyle leads strategy and insight at MasterClass, where he works with the likes of Stephen Curry, Gordon Ramsay, and Martin Scorsese to help people around the world learn from the greatest in their field. Passionate about helping businesses build analytics-driven decision making to make quicker, smarter, and bolder decisions, David has built global analytics and insight capabilities for a number of the leading entertainment businesses in the world, including television (the BBC), book publishing (HarperCollins Publishers), and the music industry (EMI Music), that helped shift each organization’s decision making at all levels, from content investment to product and brand development. His other pursuits have included building analytics for global retailers and political campaigns in the US and UK.

Presentations

The golden age of data and analytics Media and Ad Tech

David Boyle argues that we are approaching a golden age of data and analytics in the media and entertainment industries and highlights some of the fundamental questions and challenges that need to be overcome to reach the true potential of data and analytics.

Who has better taste, machines or humans? Media and Ad Tech

Algorithms decide what we see, what we listen to, what news we consume, and myriad other decisions each day. But while they can make many things more efficient, can they outperform humans in areas where the "right" outcome can't be clearly defined? In this Oxford-style debate, two teams will face off, arguing whether or not machines have better taste than humans.

Katherine Boyle is a principal on the investment team at General Catalyst. She focuses on frontier technologies and companies in highly-regulated sectors, including healthcare, computational biology, defense, aerospace and financial technology. Before entering venture capital, she was a general assignment reporter at The Washington Post covering creative industries, government accountability and retail. In 2016, she received an MBA from Stanford Graduate School of Business, where she was research assistant to Dr. Condoleezza Rice for her course and upcoming book, “Managing Political Risk.” Katherine is a graduate of Georgetown University and holds a masters degree in public advocacy from the National University of Ireland, Galway.

Presentations

Make data work: A VC panel discussion on prospectives and trends Session

To anticipate who will succeed and to invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, top-tier VCs look over the horizon to consider the big trends in how data is being put to work in startups and share what they think the field will look like in a few years (or more).

Fidan Boylu Uz is a senior data scientist on the algorithms and data science team at Microsoft, where she is responsible for successful delivery of end-to-end advanced analytics solutions. Fidan has 10+ years of technical experience in machine learning and business intelligence and has worked on projects in multiple domains such as predictive maintenance, fraud detection, mathematical optimization, and deep learning. She is a former professor at the University of Connecticut, where she conducted research and taught courses on machine learning theory and its business applications. She has authored a number of academic publications in the areas of machine learning and optimization. Fidan holds a PhD in decision sciences.

Presentations

Operationalize deep learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios Session

Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance.

Joseph Bradley is a software engineer working on machine learning at Databricks. Joseph is an Apache Spark committer and PMC member. Previously, he was a postdoc at UC Berkeley. Joseph holds a PhD in machine learning from Carnegie Mellon University, where he focused on scalable learning for probabilistic graphical models, examining trade-offs between computation, statistical efficiency, and parallelization.

Presentations

Best practices for productionizing Apache Spark MLlib models Session

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving.

Claudiu Branzan is the vice president of data science and engineering at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Kurt Brown leads the data platform team at Netflix, which architects and manages the technical infrastructure underpinning the company’s analytics, including various big data technologies like Hadoop, Spark, and Presto, Netflix open-sourced applications and services such as Genie and Lipstick, and traditional BI tools including Tableau and Redshift.

Presentations

20 Netflix-style principles and practices to get the most out of your data platform Session

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.

20 Netflix-style principles and practices to get the most out of your data platform Session

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.

Anne Buff is a business solutions manager and thought leader for SAS Best Practices, a thought leadership organization within the SAS institute, where she leverages her training and consulting experience and her data savviness to lead best practices workshops and facilitate intrateam dialogues to help companies realize their full data and analytics potential. As a speaker and author, Anne specializes in analytic strategy and culture, governance, change management, and fostering data-driven organizations. She has been a specialist in the world of data and analytics for almost 20 years and has developed courseware for a wide range of technical concepts and software, including SAS Data Management.

Presentations

Progressive data governance for emerging technologies Session

Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement.

Noah Burbank is a software engineer on Salesforce’s intelligence services team, where he focuses on the application of artificial intelligence to improve the quality of decisions that his customers can make everyday in their businesses. He holds a PhD in decision and risk analysis from Stanford University, where his research simplified complex decision making techniques for application in everyday life.

Presentations

Building a contacts graph from activity data Session

In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data.

Tobias Bürger leads the Platform and Architecture Group within the Big Data, Machine Learning, and Artificial Intelligence Department at BMW Group, where he is responsible for the global big data platform that is the core technical pillar of the BMW data lake and is used across different divisions inside the BMW Group, spanning areas such as production, aftersales, and ConnectedDrive.

Presentations

Data-driven ecosystems in the automotive industry Session

The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Josef Viehhauser and Tobias Bürger discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments.

Yuri Bykov is director of data science at Dice.com, where he and his team leverage machine learning, NLP, big data, information retrieval, and other scientific disciplines to research and build innovative data products and services that help tech professionals manage their careers. Yuri started his career as a software developer, moving into BI and data analytics before finding his passion in data science. He holds an MBA and MIS from the University of Iowa.

Presentations

Building career advisory tools for the tech sector using machine learning Session

Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production.

Henry Cai is a software engineer on the data engineering team at Pinterest, where he designs large-scale big data infrastructures. Previously, he worked at LinkedIn. Henry is the maintainer and contributor of many open source data ingestion systems, including Camus, Kafka, Gobblin, and Secor.

Presentations

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously Session

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

James Campbell is a senior data scientist and researcher at the Laboratory for Analytical Sciences (LAS), a collaborative public-private research and development organization housed at NC State University. His current work focuses on measuring and enhancing analytic quality by weaving together traditional, human-centric analytic processes with predictive, model-driven analytic tools. He is one of the core contributors to the Great Expectations project. James has worked in government for more than a decade, leading significant data science tradecraft development efforts. He has managed multiple data science teams tackling a wide range of topics, including counterterrorism and information operations. His prior analytical experience includes strategic cyberthreat intelligence research and economic analysis for litigation. James holds a bachelor’s degree in math and philosophy from Yale and a master’s degree in security studies from Georgetown. James lives in Cary, North Carolina, with his wife, two daughters, and dog. He speaks Russian, enjoys running and cycling, and designs mathematical sculpture.

Presentations

Pipeline testing with Great Expectations Session

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.

Yishay Carmiel is the founder of IntelligentWire, a company that develops and implements industry-leading deep learning and AI technologies for automatic speech recognition (ASR), natural language processing (NLP), and advanced voice data extraction, and the head of Spoken Labs, the strategic artificial intelligence and machine learning research arm of Spoken Communications. Yishay and his teams are currently working on bleeding-edge innovations that make the real-time customer experience a reality—at scale. Yishay has nearly 20 years’ experience as an algorithm scientist and technology leader building large-scale machine learning algorithms and serving as a deep learning expert.

Presentations

Executive Briefing: The conversational AI revolution Session

One of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean and to take particular actions based on that information. This goal is the essence of conversational AI. Yishay Carmiel explores the latest breakthroughs and revolutions in this field and the challenges still to come.

Michelle Casbon is a senior engineer on the Google Cloud Platform developer relations team, where she focuses on open source contributions and community engagement for machine learning and big data tools. Michelle’s development experience spans more than a decade and has primarily focused on multilingual natural language processing, system architecture and integration, and continuous delivery pipelines for machine learning applications. Previously, she was a senior engineer and director of data science at several San Francisco-based startups, building and shipping machine learning products on distributed platforms using both AWS and GCP. She especially loves working with open source projects and is a contributor to Kubeflow. Michelle holds a master’s degree from the University of Cambridge.

Presentations

Continuous delivery for NLP on Kubernetes: Lessons learned Session

Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime.

William Chambers is a product manager at Databricks, where he works on Structured Streaming and data science products. He is lead author of Spark: The Definitive Guide, coauthored with Matei Zaharia. Bill also created SparkTutorials.net as a way to teach Apache Spark basics. Bill holds a master’s degree in information management and systems from UC Berkeley’s School of Information. During his time at school, Bill was also creator of the Data Analysis in Python with pandas course for Udemy and cocreator of and first instructor for Python for Data Science, part of UC Berkeley’s Masters of Data Science program.

Presentations

Streaming big data in the cloud: What to consider and why Session

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud.

Rachita Chandra is a solutions architect at IBM Watson Health, where she brings together end-to-end machine learning solutions in healthcare. She has experience implementing large-scale, distributed machine learning algorithms. Rachita holds both a master’s and bachelor’s degree in electrical and computer engineering from Carnegie Mellon.

Presentations

Transforming a machine learning prototype to a deployable solution leveraging Spark in healthcare Session

Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment.

Chris Chapo is vice president of customer data and analytics within the customer and strategy team at Gap Inc., where he helps the company increase its ability to best take advantage of quantitative information about customers and ensures all brands take a data-driven approach in transforming the company to be customer obsessed. Chris has extensive experience building data science organizations, teams, and platforms and applying statistical and analytic rigor to a variety of functions, including marketing, customer experience, loyalty, and customer service and support. Previously, he led data teams for a wide variety of companies, including Apple, Intuit, JCPenney, and Enjoy.

Presentations

Lessons on driving data science and analytics transformation Session

Chris Chapo walks you through real-world examples of companies that are driving transformational change by leveraging data science and analytics, paying particular attention to established organizations where these capabilities are newer concepts.

Anny (Yunzhu) Chen is a senior data scientist at Uber working on time series anomaly detection and forecasting. Anny is passionate about applying statistical and machine learning models to real business problems. Previously, she was a data scientist at Adobe, where she worked on digital attribution modeling for customer conversion data. She holds an MS in statistics from Stanford University and a BS in probability and statistics from Peking University.

Presentations

Detecting time series anomalies at Uber scale with recurrent neural networks Session

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.

April Chen is a lead data scientist on the R&D team at Civis Analytics, where she develops software to automate statistical modeling workflows to help organizations from Fortune 500 companies to nonprofits understand and leverage their data. April’s background is in economics. Previously, she worked as an analytics consultant.

Presentations

Show me the money: Understanding causality for ad attribution Media and Ad Tech

Which of your ad campaigns lead to the most sales? In the absence of A/B testing, marketers often turn to simple touch attribution models. April Chen details the shortcomings of these models and proposes a new approach that uses matching methods from causal inference to more accurately measure marketing effectiveness.

Shuyi Chen is a senior software engineer at Uber working on building scalable real-time data solutions. He built Uber’s real-time complex event processing platform for the marketplace, which powers 100+ production real-time use cases. Currently, he is the tech lead of Uber’s stream processing platform team. Shuyi has years of experience in storage infrastructure, data infrastructure, and Android and iOS development at both Google and Uber.

Presentations

Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber Session

Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges.

Wei Ting Chen is a senior software engineer in Intel’s Software Service Group, where he works on big data on cloud solutions. One of his responsibilities is helping customers integrate big data solutions into their cloud infrastructure. Wei Ting is a contributor to the OpenStack Sahara project.

Presentations

Spark on Kubernetes: A case study from JD.com Session

Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides.

Pramit Choudhary is a lead data scientist at DataScience.com, where he focuses on optimizing and applying classical machine learning and Bayesian design strategy to solve real-world problems. Currently, he is leading initiatives on figuring out better ways to explain a model’s learned decision policies to reduce the chaos in building effective models and close the gap between a prototype and operationalized model.

Presentations

Human in the loop: Bayesian rules enabling explainable AI Session

Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation.

Meet the Expert with Pramit Choudhary (DataScience.com) Meet the Experts

Pramit will talk in detail about enabling explainable AI, including: the need for better model interpretation, how to build interpretable machine learning system using the open source framework "Skater", and Bayesian Rule Lists, and other algorithms for assisting human decision.

Michael Chui is a San Francisco-based partner in the McKinsey Global Institute, where he directs research on the impact of disruptive technologies, such as big data, social media, and the internet of things, on business and the economy. Previously, as a McKinsey consultant, Michael served clients in the high-tech, media, and telecom industries on multiple topics. Prior to joining McKinsey, he was the first chief information officer of the City of Bloomington, Indiana, and was the founder and executive director of HoosierNet, a regional internet service provider. Michael is a frequent speaker at major global conferences and his research has been cited in leading publications around the world. He holds a BS in symbolic systems from Stanford University and a PhD in computer science and cognitive science and an MS in computer science, both from Indiana University.

Presentations

Executive Briefing: Artificial intelligence—The next digital frontier? Session

After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge.

Garner Chung is the engineering manager of the human computation team and the data science team supporting core product, growth, and infrastructure at Pinterest. Previously, he managed the data science team at Opower, where he drove efforts to research and productionize predictive models for all of product and engineering. Many years ago, he studied film at UC Berkeley, where he learned to deconstruct and complicate misleadingly simple narratives. Over the course of his 20 years in the tech industry, he has witnessed exuberance over technology’s great promise ebb and flow, all the while remaining steadfast in his gratitude for having played some small part. As a leader, Garner has learned to drive teams that privilege responsibility and end-to-end ownership over arbitrary commitments.

Presentations

Humans versus the machines: Using human-based computation to improve machine learning Session

Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform.

Ira Cohen is a cofounder and chief data scientist at Anodot, where he is responsible for developing and inventing the company’s real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

The real-time journey from raw streaming data to AI-based analytics Session

Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution.

Eric Colson is chief algorithms officer at Stitch Fix, where he leads a team of 80+ data scientists and is responsible for the multitude of algorithms that are pervasive to nearly every function of the company, from merchandise, inventory, and marketing to forecasting and demand, operations, and the styling recommender system. He’s also an advisor to several big data startups. Previously, Eric was vice president of data science and engineering at Netflix. He holds a BA in economics from SFSU, an MS in information systems from GGU, and an MS in management science and engineering from Stanford.

Presentations

Differentiating via data science Keynote

While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization.

Matt Conners is a statistician and data sciences practitioner with extensive business operations and industry domain experience. He has over 20 years of financial technology experience across sales, marketing, business operations, securities, and banking. He has worked at Microsoft since 1995. He is a program manager in the Data Group working with customers, partners, and data scientists on financial forecasting solutions.

Presentations

How to successfully reinvent productivity in finance with machine learning (Hint: machine learning is only part of it.) Data Case Studies

Microsoft’s finance organization is reinventing forecasting using machine learning that its leaders describe as game changing. Matt Conners shares the lessons the data sciences and finance teams learned while bringing machine learning forecasting to the office of the CFO by improving forecast accuracy and frequency and driving cultural change through a finance center of excellence.

Mike Conover is an AI engineer at SkipFlag, where he builds machine learning technologies that leverage the behavior and relationships of hundreds of millions of people. Previously, Mike led news relevance research and development at LinkedIn. His work has appeared in the New York Times and the Wall Street Journal and on National Public Radio. Mike holds a PhD in complex systems analysis with a focus on information propagation in large-scale social networks.

Presentations

Fast and effective natural language understanding Session

Mike Conover offers an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, Ian was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

sparklyr, implyr, and more: dplyr interfaces to large-scale data Session

The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.

Dan Crankshaw is a PhD student in the CS Department at UC Berkeley, where he works in the RISELab. After cutting his teeth doing large-scale data analysis on cosmology simulation data and building systems for distributed graph analysis, Dan has turned his attention to machine learning systems. His current research interests include systems and techniques for serving and deploying machine learning, with a particular emphasis on low-latency and interactive applications.

Presentations

Deploying and monitoring interactive machine learning applications with Clipper Session

Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Opening remarks Tutorial

Strata Data Conference program chair Alistair Croll welcomes you to the Data Case Studies tutorial.

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Jonathan Crow is senior director of advanced analytics and business intelligence at Wargaming. An expert in predictive and prescriptive analytics, Jonathan and his team of data scientists analyze video game data and develop models for fraud/cheat detection, business model forecasting, targeted CRM, and marketing optimization to improve processes, player engagement, and revenue.

Presentations

Winning the big data war pays big dividends for Wargaming (sponsored by SAS) Session

Alexander Ryabov and Jonathan Crow explain how Wargaming is winning the battle for bigger profits in the virtual world of online gaming using a best-in-class business intelligence solution to equip its business units with decision-making tools.

Umur Cubukcu is cofounder and CEO of Citus Data, a leading Postgres company whose mission is to make it so companies never have to worry about scaling their relational database again. Focusing on both operations and strategy, Umur works directly with technical founders at SaaS companies to help them scale their multitenant applications and with enterprise leaders to power real-time apps that need to handle large-scale data. Umur’s team at Citus Data is active in the Postgres community, sharing expertise and contributing key components and extensions. Citus Data open-sourced its distributed database extension for PostgreSQL in early 2016. Umur has over 15 years of experience driving complex enterprise software, IT, and database initiatives at large enterprises and startups, and he has a deep interest in how scalable systems of record and systems of engagement can help businesses grow. He holds a master’s degree in management science and engineering from Stanford University.

Presentations

The state of Postgres Session

PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you'll learn how PostgreSQL's extension APIs are fueling innovations in relational databases.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Mauro Damo is a senior data scientist at Dell, where he is responsible for helping organizations identify, develop, and implement analytical solutions in big data environments, focusing on solving business problems. He has developed and implemented analytical projects for a number of companies in a range of industries, including health care, mortgage insurance, financial brokers, cable companies, nongovernmental organizations, and supply chain. He has experience with a wide range of supervised and unsupervised models, including time series, graphs analysis, optimization models, and deep learning models such as convolutional neural networks, recurrent neural networks, neural networks, clustering, dimensional reduction, tree algorithms, frequent pattern mining, ensembles models, Markov chains, and gradient descent. Mauro holds patents, has authored several papers, and speaks at international conferences, seminars, and classes. His main programming languages are Python, R and SQL. He holds an MS in business, an MBA in finance, an undergraduate degree in business, and an associate degree in computer science.

Presentations

Bladder cancer diagnosis using deep learning Session

Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer in patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive.

Anoop Dawar is vice president of product management and marketing at MapR Data Technologies. Anoop has more than a decade of experience leading product management and development teams at Cisco and Aerohive. His scientific approach to product management stems from his background in business and technology, as both a practitioner and a student. Anoop holds an MS degree in computer science from the University of Texas Austin, where he was a TA for a graduate-level AI course, and an MBA from the Wharton School at the University of Pennsylvania, with a major in finance.

Presentations

The case for a deliberate data strategy in today’s attention-deficit economy (sponsored by MapR) Keynote

We are inundated with ideas and technology news in today’s data-rich but attention-deficit economy. In this environment, competitive advantage comes not from what is abundant (i.e., data) but from what is scarce—the ability to deploy insights in real time. Anoop Dawar explains how your peers are succeeding in shrinking the insight-to-action cycle and achieving great results.

Rahim Daya is head of search products at Pinterest. Previously, he led search and recommendation product teams at LinkedIn and Groupon.

Presentations

Personalization at scale: Mastering the challenges of personalization to create compelling user experiences Session

Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user.

Danielle Dean is a principal data scientist lead at Microsoft in the Algorithms and Data Science Group within the Artificial Intelligence and Research Division, where she leads a team of data scientists and engineers building predictive analytics and machine learning solutions with external companies utilizing Microsoft’s Cloud AI Platform. Previously, she was a data scientist at Nokia, where she produced business value and insights from big data through data mining and statistical modeling on data-driven projects that impacted a range of businesses, products, and initiatives. Danielle holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill, where she studied the application of multilevel event history models to understand the timing and processes leading to events between dyads within social networks.

Presentations

How does a big data professional get started with AI? Session

Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI.

Jeff Dean is a Google senior fellow in Google’s Research Group, where he cofounded and leads the Google Brain team, Google’s deep learning and artificial intelligence research team. He and his collaborators are working on systems for speech recognition, computer vision, language understanding, and various other machine learning tasks. During his time at Google, Jeff has codesigned and implemented many generations of Google’s crawling, indexing, and query serving systems, major pieces of Google’s initial advertising and AdSense for content systems, and Google’s distributed computing infrastructure, including the MapReduce, BigTable and Spanner systems, protocol buffers, LevelDB, systems infrastructure for statistical machine translation, and a variety of internal and external libraries and developer tools. Jeff is a fellow of the ACM and the AAAS, a member of the US National Academy of Engineering, and a recipient of the ACM-Infosys Foundation Award in the Computing Sciences. He holds a PhD in computer science from the University of Washington, where he worked with Craig Chambers on whole-program optimization techniques for object-oriented languages, and a BS in computer science and economics from the University of Minnesota.

Presentations

Using deep learning to solve challenging problems Session

The Google Brain team conducts research on difficult problems in artificial intelligence and builds large-scale computer systems for machine learning research, both of which have been applied to dozens of Google products. Jeff Dean highlights some of Google Brain's projects with an eye toward how they can be used to solve challenging problems.

Anirban Deb is a data science manager at Uber. A seasoned data science and analytics leader, Anirban has extensive experience building and managing high-performing teams to support strategic decision making, business analytics, marketing analytics, product analytics, predictive modeling, reporting, and executive communication.

Presentations

Presto query gate: Identifying and stopping rogue queries Session

Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money.

Alex Deng is a principal data scientist manager on Microsoft’s analysis and experimentation team, where he and his team work on methodological improvements of the experimentation platform as well as related engineering challenges. Alex has published his work in conference proceedings like KDD, WWW, WSDM, and other statistical journals. He colectured a tutorial on A/B testing at JSM 2015. Alex holds a PhD in statistics from Stanford University and a BS in mathematics from Zhejiang university.

Presentations

A/B testing at scale: Accelerating software innovation Tutorial

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.

Matt Derda is a customer success manager at Trifacta. Previously, Matt was a CPFR (collaborative planning, forecasting, and replenishment) analyst at PepsiCo, where he worked with Trifacta to accelerate the preparation of customer supply chain data to more accurately and quickly forecast sales.

Presentations

Detecting retail fraud with data wrangling and machine learning Session

Matt Derda and Harrison Lynch explain how Consensus leverages the combined power of data wrangling and machine learning to more efficiently identify and reduce retail fraud and how adopting data wrangling technology has helped Trifacta reduce time spent data wrangling from six weeks to one week.

Wei Di is a staff member on LinkedIn’s business analytics data mining team. Wei is passionate about creating smart and scalable solutions that can impact millions of individuals and empower successful business. She has wide interests covering artificial intelligence, machine learning, and computer vision. Previously, Wei worked with eBay Human Language Technology and eBay Research Labs, where she focused on large-scale image understanding and joint learning from visual and text information, and worked at Ancestry.com in the areas of record linkage and search relevance. Wei holds a PhD from Purdue University.

Presentations

Ask Me Anything: Big data and machine learning techniques to drive and grow business Session

Join Burcu Baran and Wei Di to discuss big data in business analytics, machine learning in business analytics, and achieving actionable insights from big data.

Big data analytics and machine learning techniques to drive and grow business Tutorial

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.

Maria Diaz is a Principal Consultant at ASI. She has more than ten years experience in helping organisations solve business problems by applying digital solutions and artificial intelligence. She also has expertise in advising fast growing organisations to transform their processes to achieve scale and growth. Before joining ASI, Maria was responsible for the client digital operations at Teradata Marketing Applications, leading the customer success, technical and project management teams. Prior to that, Maria managed the digital production team at eBay Enterprise.

Presentations

Data science for managers 2-Day Training

Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Seth Dobrin is the vice president and chief data officer for IBM Analytics, where he is responsible for the transformation of the analytics business operations using data and analytics, influencing IBM Analytics offerings to meet the needs of CDOs, and providing his perspective and experiences to clients. Seth has spent his career scaling and combining existing technologies to address previously intractable problems at scale. Previously, he led the data science transformation of a Fortune 500 company, as well as the company’s Agile transformation and their shift to the cloud, and oversaw efforts to leverage the data science transformation to drive new business models to create new revenue streams, along with the optimization of existing processes. He has served as a member of the advisory board for IBM’s Spark Technology Center, is a founding member of the International Society of Chief Data Officers, and has been a prolific panelist at the East and West Chief Data Officer Summits. Seth holds a PhD in genetics from Arizona State University, where he focused on the application of statistical and molecular genetics toward the elucidation of the causes of neuropsychiatric disorders. In these efforts, he was involved in some of the first applications of machine learning to large-scale genetic analysis.

Presentations

Journey to digital (sponsored by IBM) Session

Companies that want to become truly digital must take a journey of three steps: data transformation, data science transformation, and digital transformation. This also requires transforming the business with machine learning to fundamentally change the relationship with customers. Seth Dobrin explains the detailed steps along the way to digital transformation—and the pitfalls.

Harish Doddi is counder and CEO of Datatron Technologies. Previously, he held roles at Oracle; Twitter, where he worked on open source technologies, including Apache Cassandra and Apache Hadoop, and built Blobstore, Twitter’s photo storage platform; Snapchat, where he worked on the backend for Snapchat stories; and Lyft, where he worked on the surge pricing model. Harish holds a master’s degree in computer science from Stanford, where he focused on systems and databases, and an undergraduate degree in computer science from the International Institute of Information Technology in Hyderabad.

Presentations

Lessons learned deploying machine learning and deep learning models in production at major tech companies Session

Deploying machine learning models and deep learning models in production is hard. Harish Doddi and Jerry Xu outline the enterprise data science lifecycle, covering how production model deployment flow works, challenges, best practices, and lessons learned. Along the way, they explain why monitoring models in the production should be mandatory.

Mark Donsky leads product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogeneous data environments. Previously, Mark led data management and governance solutions at Cloudera. Mark has held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments Tutorial

New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR.

Meet the Expert with Mark Donsky and Steven Ross (Cloudera) Meet the Experts

Curious about GDPR? Stop by and chat with Mark and Steven about best practices for moving toward GDPR compliance, what other organizations are achieving, and how GDPR will impact your organization.

Joe Dumoulin is CTIO at Next IT. Joe has been working as a professional programmer since 1985, focusing on optimization methods relevant to manufacturing and enterprise processes. Currently, most of his work centers around machine learning and natural language processing and managing a research team at customer engagement company Verint. He has helped create some of the earliest and most widely used commercial automated conversational applications and a number of early commercial prototypes in the conversational AI space. In his spare time, Joe teaches CS topics at local universities, builds and programs electronics projects, reads, and plays with his grandkids.

Presentations

Your enterprise AI is only as good as your data. Data Case Studies

AI is transformative for business, but it’s not magic; it’s data. Joe Dumoulin shares how Next IT's global enterprise customers have transformed their businesses with AI solutions and outlines how companies should build AI strategies, utilize data to develop and evolve conversational intelligence and business intents, and ultimately increase ROI.

Ted Dunning is chief application architect at MapR. He’s also a board member for the Apache Software Foundation, a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects, and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He has contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Better machine learning logistics with the rendezvous architecture Session

Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime.

Data at scale and speed: Real-world use cases (sponsored by MapR) Session

Getting value from data at large scale and on a variety of time scales is hard. True, it's not as hard as it used to be, but you still don’t win by default. Ted Dunning explains why it takes good design, the right technology, and a pragmatic approach to succeed.

Meet the Expert with Ted Dunning (MapR Technologies) Meet the Experts

If you're interested in machine learning and the logistical aspects of supporting it in production, come talk to Ted. He'll also discuss: data platforms, streaming architecture, Kubernetes, containers, and rendezvous architecture.

Glynn Durham is a senior instructor at Cloudera. Previously, he worked for Oracle, Forté Software, MySQL, and Cloudera, spending five or more years at each.

Presentations

Data science and machine learning with Apache Spark 2-Day Training

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Data science and machine learning with Apache Spark (Day 2) Training Day 2

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Zoran Dzunic is a data scientist on the algorithms and data science (ADS) team within the AI+R Group at Microsoft, where he focuses on machine learning applications for text analytics and natural language processing. He holds a PhD and a master’s degree from MIT, where he focused on Bayesian probabilistic inference, and a bachelor’s degree from the University of Nis in Serbia.

Presentations

Deep learning for domain-specific entity extraction from unstructured text Session

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.

Nick Elprin is the CEO and cofounder of Domino Data Lab, a data science platform that enterprises use to accelerate research and more rapidly integrate predictive models into their business. Nick has over a decade of experience working with quantitative researchers and data scientists, stemming from his time as a senior technologist at Bridgewater Associates, where his team designed and built the firm’s next-generation research platform.

Presentations

Ask Me Anything: Managing data science in the enterprise Session

Join Nick Elprin to discuss the challenges associated with evolving from random acts of data science to data science as a core competency, common pitfalls and best practices for implementing process, hiring people, and deploying diverse technology, designing and running data science organizations, and more.

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Sergey Ermolin is a software solutions architect for deep learning, Spark analytics, and big data technologies at Intel. A Silicon Valley veteran with a passion for machine learning and artificial intelligence, Sergey has been interested in neural networks since 1996, when he used them to predict aging behavior of quartz crystals and cesium atomic clocks made by Hewlett-Packard. Sergey holds an MSEE and a certificate in mining massive datasets from Stanford and BS degrees in both physics and mechanical engineering from California State University, Sacramento.

Presentations

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling Session

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale Session

Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach.

Guy Ernest is a principal solutions architect in Amazon AI’s Deep Learning Group, where he works closely with developers, data scientists and executives to bootstrap and scale machine learning practices in Amazon Web Service customers around the world. Guy has two decades of experience building machine learning systems in computer vision, document analysis, recommendation engines.

Presentations

Building machine learning systems for scale: Amazon insights and best practices (sponsored by Amazon Web Services) Session

Amazon SageMaker is platform to build, train, and deploy machine learning models at any scale. Guy Ernest explores the scalable algorithms that SageMaker provides, distributed training with Apache MXNet and TensorFlow, automatic tuning of hyperparameters, and model deployments.

Lenny Evans is a data scientist at Uber focused on the applications of unsupervised methods and deep learning to fraud prevention, specifically developing anomaly detection models to prevent account takeovers and computer vision models for verifying possession of credit cards.

Presentations

Using computer vision to combat stolen credit card fraud Session

Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities.

Natalie Evans Harris is the founder of Harris Data Consulting and COO of BrightHive. Natalie has spent more than 16 years driving the strategic use of data to answer some of our nation’s toughest questions and driving organizational success and working with a broad network of academic institutions, data science organizations, application developers, and foundations to increase the use of accessible data standards, APIs, and ethical algorithms in scaling data science efforts that directly benefit people receiving social services. Most recently, she brought together Bloomberg, Data for Democracy, and BrightHive to lead the development of a Data Science Code of Ethics through the Community-driven Principles for Ethical Data Sharing (CPEDS) Initiative. Previously, Natalie was a senior policy advisor to the US Chief Technology Officer in the Obama administration, where she founded the Data Cabinet, a federal data science community of practice with over 200 active members across more than 40 federal agencies, co-led a cohort of federal, nonprofit, and for-profit organizations to develop data-driven tools through the Opportunity Project, and established the Open Skills Community through the Workforce Data Initiative. She also led an analytics development center for the National Security Agency (NSA) that served as the foundation for the enterprise data science development program and became a model for other intelligence community agencies. Her achievements resulted in being the sole member of the NSA chosen as a Brookings legislative fellow. As a member of Senator Cory Booker’s (NJ) legislative team, she focused on cyber and governmental affairs issues, serving as his lead technical and policy advisor on bills such as the Cyber Information Security Protection Act (CISPA). Natalie holds a master’s degree in public administration from George Washington University and both a BS in computer science and a BS in sociology from University of Maryland Eastern Shore.

Presentations

Data and ethics : Brainstorming Session Session

Join Natalie Evans Harris for a brainstorming session on data and ethics. You'll cover the current Community Principles on Ethical Data Practices (CPEDP) and next steps, existing tools that support ethical data practices, how the community can support the needs of the individual, and whether or not the community needs to be held accountable to regulations (or something more like fiduciary duty).

Defining responsible data practices: A community-driven approach Keynote

Natalie Evans Harris explores the Community Principles on Ethical Data Practices (CPEDP), a community-driven code of ethics for data collection, sharing, and utilization that provides people in the data science community a standard set of easily digestible, recognizable principles for guiding their behaviors.

Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Previously, Bin worked at Google building next-generation storage infrastructure, where he won Google’s Technical Infrastructure award. He holds a PhD in computer science from Carnegie Mellon University.

Presentations

Powering robotics clouds with Alluxio Session

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.

Li Fan is the senior vice president of engineering at Pinterest, where she leads the company’s technical direction and oversees a team of 400+ engineers building a visual discovery engine. Previously, Li was the senior director of engineering at Google, where she led image search; vice president of engineering at Baidu, where she was responsible for product design and development at China’s largest search engine; and a software developer and engineering manager at Cisco and Ingrian Networks. She holds a master’s degree in computer science from the University of Wisconsin-Madison and a BS in computer science from Fudan University in Shanghai.

Presentations

Merging human and machine learning for everyday solutions Keynote

Li Fan shares insights into how Pinterest improves products based on usage and explains how the company is using AI to predict what’s in an image, what a user wants, and what they’ll want next, answering subjective questions better than machines or humans alone could achieve.

Zhen Fan is a software development engineer at JD.com, where he focuses on machine learning platform development and management.

Presentations

Spark on Kubernetes: A case study from JD.com Session

Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides.

Ilan Filonenko is a four-time returning engineering intern at Bloomberg LP, where he has designed and architected distributed systems at both the application and infrastructure level. Previously, Ilan was an engineering consultant and technical lead in various startups and research divisions across multiple industry verticals, including medicine, hospitality, finance, and music. Ilan’s current research studies algorithmic, software, and hardware techniques for high-performance machine learning, with a focus on optimizing stochastic algorithms such as stochastic gradient descent (SGD).

Presentations

HDFS on Kubernetes: Tech deep dive on locality and security Session

There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support.

Keno Fischer is CTO of Julia Computing, where he leads the company’s efforts in the compiler and developer tools space. Keno has been a core developer of the Julia Language for more than five years. Keno holds an AM in physics and an AB in physics, mathematics, and computer science from Harvard University.

Presentations

Cataloging the visible universe through Bayesian inference at petascale in Julia Session

Julia is rapidly becoming a popular language at the forefront of scientific discovery. Keno Fischer explores one of the most ambitious use cases for Julia: using machine learning to catalog astronomical objects to derive a catalog from multiterabyte size astronomical image datasets. This work was a collaboration between MIT, UC Berkeley, LBNL, and Julia Computing.

Tom Fisher is CTO at MapR Technologies, where he helps enterprise customers take full advantage of MapR technology and leads initiatives to advance the company’s innovation agenda globally. Previously, Tom was a senior executive in engineering and operations at Oracle, where he supported the company’s top 40 cloud customers globally and served as senior vice president and CIO for global commercial cloud services focusing on improving service delivery through automation and direct action with customers; CIO and vice president of cloud computing at SuccessFactors (now SAP), where he ran cloud operations and emerging technologies in product engineering; CIO of CDMA technologies at Qualcomm; and vice president and acting CTO at eBay.

Presentations

Cloud, multicloud, and the data refinery Session

The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to next-generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations.

Wayde Fleener is senior manager of decision sciences at General Mills. A seasoned marketing strategy and analytics leader with experience combining disparate information sources to make the right business decision, Wayde has run all aspects of a marketing analytics function, including managing methods, technology, and people. Over his career, he has led diverse teams helping over 100 Fortune 1000 companies across virtually every industry maximize their investments in marketing.

Presentations

Automating business insights through artificial intelligence Data Case Studies

Decision makers are busy. Businesses can hire people to analyze data for them, but most companies are resource constrained and can’t hire a small army to look through all their data. Wayde Fleener explains how General Mills implemented automation to enable decision makers to quickly focus on the metrics that matter and cut through everything else that does not.

Brian Foo is a senior software engineer for Google Cloud working on applied artificial intelligence, where he builds demos for Google Cloud’s strategic customers and creates open source tutorials to improve public understanding of AI. Previously, Brian worked at Uber, where he trained machine learning models and built a large-scale training and inference pipeline for mapping and sensing/perception applications using Hadoop and Spark, and headed the real-time bidding optimization team at Rocket Fuel, where he worked on algorithms that determined millions of ads shown every second across many platforms such as web, mobile, and programmatic TV. Brian holds a BS in EECS from UC Berkeley and a PhD in EE telecommunications from UCLA.

Presentations

Deploying deep learning with TensorFlow Tutorial

TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin and Brian Foo to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Michael J. Freedman is the cofounder and CTO of TimescaleDB, an open source database that scales SQL for time series data, and a professor of computer science at Princeton University, where his research focuses on distributed systems, networking, and security. Previously, Michael developed CoralCDN (a decentralized CDN serving millions of daily users) and Ethane (the basis for OpenFlow and software-defined networking) and cofounded Illuminics Systems (acquired by Quova, now part of Neustar). He is a technical advisor to Blockstack. Michael’s honors include the Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), the SIGCOMM Test of Time Award, the Caspar Bowden Award for Privacy Enhancing Technologies, a Sloan Fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, a DARPA Computer Science Study Group membership, and multiple award publications. He holds a PhD in computer science from NYU’s Courant Institute and bachelor’s and master’s degrees from MIT.

Presentations

Meet the Expert with Michael Freedman (TimescaleDB | Princeton) Meet the Experts

Join Mike to learn more about TimescaleDB, a new open source database designed for time series workloads, and ask any questions you may have about time series.

TimescaleDB: Reengineering PostgreSQL as a time series database Session

Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries.

Chris Fregly is founder and research engineer at PipelineAI, a San Francisco-based streaming machine learning and artificial intelligence startup. Previously, Chris was a distributed systems engineer at Netflix, a data solutions engineer at Databricks, and a founding member of the IBM Spark Technology Center in San Francisco. Chris is a regular speaker at conferences and meetups throughout the world. He’s also an Apache Spark contributor, a Netflix Open Source committer, founder of the Global Advanced Spark and TensorFlow meetup, author of the upcoming book Advanced Spark, and creator of the O’Reilly video series Deploying and Scaling Distributed TensorFlow in Production.

Presentations

Building ML and AI pipelines with Spark and TensorFlow Session

Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file formats such as Parquet, CSV, JSON, and Avro stored in HDFS and S3.

Ellen Friedman is principal technologist for MapR Technologies. Ellen is a committer on the Apache Drill and Apache Mahout projects and coauthor of a number of books on computer science, including Machine Learning Logistics, Streaming Architecture, the Practical Machine Learning series, and Introduction to Apache Flink. Ellen has been an invited speaker at Strata Data conferences, Big Data London, Big Data Paris, Berlin Buzzwords, Nike Tech Talks, the University of Sheffield Methods Institute in the UK, and NoSQL Matters Barcelona. She holds a PhD in biochemistry.

Presentations

DataOps: An Agile methodology for data-driven organizations Session

DataOps—a culture and practice for building data-intensive applications, including machine learning pipelines—expands DevOps philosophy to include data-heavy roles such as data engineering and data science. DataOps is based on cross-functional collaboration resulting in fast time to value and an agile workflow. Ellen Friedman offers an overview of DataOps and explains how to implement it.

Siddha Ganju is a data scientist at Deep Vision, where she works on building deep learning models and software for embedded devices. Siddha is interested in problems that connect natural languages and computer vision using deep learning. Her work ranges from visual question answering to generative adversarial networks to gathering insights from CERN’s petabyte scale data and has been published at top tier conferences like CVPR. She is a frequent speaker at conferences and advises the Data Lab at NASA. Siddha holds a master’s degree in computational data science from Carnegie Mellon University, where she worked on multimodal deep learning-based question answering. When she’s not working, you might catch her hiking.

Presentations

Being smarter than dinosaurs: How NASA uses deep learning for planetary defense Session

Siddha Ganju explains how the FDL lab at NASA uses artificial intelligence to improve and automate the identification of meteors above human-level performance using meteor shower images and recover known meteor shower streams and characterize previously unknown meteor showers using orbital data—research aimed at providing more warning time for long-period comet impacts.

Amanda Gerdes is the senior manager of data engineering at Blizzard Entertainment, where she has helped spark an explosion of big data and driven architectural efforts to harness it into a working power source, fostering the same level of exuberant geekiness about data as she and her comrades have always had toward video games. Mandy believes that life is a true level grind; that sometimes experience points come from major questlines and other times they just come from a bunch of dead dire rats; that being a good leader is shockingly similar to being a good dungeon master; and that FemShep is the only Shep. She loves big data, Atlas stones, and Diablo 3 seasonal leaderboards. Mandy holds master’s degrees in systems engineering and business administration, the title of California’s Strongest Woman 2014, and more Heroes of the Storm master skins than you can shake a Doomhammer at.

Presentations

The data hero’s journey Media and Ad Tech

Session with Amanda Gerdes

Ari Gesher is the founding director of software engineering at Kairos Aerospace, a startup building and operating the next-generation of airborne and spaceborne sensors for monitoring oil and gas infrastructure. Ari also serves as consulting architect for Jupiter, a company productizing high-quality datasets that describe the long-term effects of climate change. Previously, he was a very early engineer at Palantir Technologies and later served as Palantir’s engineering ambassador to the tech community at large; before Palantir, he was the maintainer of the SourceForge.net open source archive. Ari is the coauthor of The Architecture of Privacy, which explains how to responsibly hold data about people while preserving their privacy to the greatest extent possible. Ari is a frequent speaker on various topics, including the need for modern, high-leverage engineers to work on substantive problems, human-computer symbiosis as system design aesthetic, the limits of automated decision making, and privacy architectures for a world where everything is recorded.

Presentations

Big data, big problems: Predicting climate change Session

A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products.

Debasish Ghosh is principal software engineer at Lightbend. Passionate about technology and open source, he loves functional programming and has been trying to learn math and machine learning. Debasish is an occasional speaker in technology conferences worldwide, including the likes of QCon, Philly ETE, Code Mesh, Scala World, Functional Conf, and GOTO. He is the author of DSLs In Action and Functional & Reactive Domain Modeling. Debasish is a senior member of ACM. He’s also a father, husband, avid reader, and Seinfeld fanboy who loves spending time with his beautiful family.

Presentations

Approximation data structures in streaming data processing Session

Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures.

Noah Gift is consulting CTO and lecturer at UC Davis. An adaptable technical leader, entrepreneur, software developer, architect, and engineer with over 20 years’ experience in leadership and engineering (including P&L responsibility), over the past eight years, Noah has shipped than 10 new products at multiple companies that generated millions of dollars of revenue and had global scale. Previously, Noah helped build Sqor Sports from scratch, creating the company’s first product and hiring and managing all employees. He has also written production machine learning models in Python and R. Noah is the author of the forthcoming book Pragmatic AI: An Introduction to Cloud-Based Machine Learning as well as a number of articles.

Presentations

What is the relationship between social influence and the NBA? Media and Ad Tech

Noah Gift uses data science and machine learning to explore NBA team valuation and attendance as well as individual player performance. Questions include: What drives the valuation of teams (attendance, the local real estate market, etc.)? Does winning bring more fans to games? Does salary correlate with social media performance?

Zachary Glassman is a data scientist in residence at the Data Incubator. Zachary has a passion for building data tools and teaching others to use Python. He studied physics and mathematics as an undergraduate at Pomona College and holds a master’s degree in atomic physics from the University of Maryland.

Presentations

Hands-on data science with Python 2-Day Training

Zachary Glassman demonstrates how to build intelligent business applications using machine learning, taking you through each step in developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend your knowledge by building two applications from real-world datasets.

Hands-on data science with Python (Day 2) Training Day 2

Zachary Glassman demonstrates how to build intelligent business applications using machine learning, taking you through each step in developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend your knowledge by building two applications from real-world datasets.

Dhruv is a PM for Microsoft Azure Data Services, focused on Open Source Analytics. Prior to working for Microsoft, Dhruv got his MBA from Wharton Business School, graduating as a Palmer Scholar. He has a BS in Computer Science and has worked as a Software Engineer at Amazon.

Presentations

Using machine learning to simplify Kafka operations Session

Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka.

Clare Gollnick is the Director of Data Science at NS1 based out of New York City.

Presentations

The limits of inference: What data scientists can learn from the reproducibility crisis in science Session

At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project.

Abe Gong is CEO and cofounder at Superconductive Health. A seasoned entrepreneur, Abe has been leading teams using data and technology to solve problems in healthcare, consumer wellness, and public policy for over a decade. Previously, he was chief data officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe holds a PhD in public policy, political science, and complex systems from the University of Michigan. He speaks and writes regularly on data science, healthcare, and the internet of things.

Presentations

Pipeline testing with Great Expectations Session

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.

Ajey Gore is group CTO at GO-JEK, primarily focused on payments and organization-wide technology and team strategies, where he helps the company deliver a transport, logistics, lifestyle, and payments platform of 20 products. Ajey has 18 years of experience building core technology strategy across diverse domains. His interests include machine learning, networking, and distributed architecture systems. Previously, Ajey founded CodeIgnition (acquired by GO-JEK); served as ThoughtWorks’s head of technology; and was CTO at Hoppr, a Bharati SoftBank-funded startup (acquired by Hike Messenger). An active influencer in the technology community, Ajey is a trustee of the Emerging Technology Trust, a not-for-profit organization and organizes conferences, including RubyConf, GopherCon, and devopsdays.

Presentations

Inclusivity for the greater good Keynote

Ajey Gore details GO-JEK's evolution from a small bike-hailing startup to a technology-focused unicorn in the areas of transportation, lifestyle, payments, and social enterprise and explains how the company is focusing its attention beyond urban Indonesia to impact more than a million people across the country's rural areas.

Martin Görner works in developer relations at Google, where he focuses on parallel processing and machine learning. Passionate about science, technology, coding, algorithms, and everything in between, Martin’s first role was in the Computer Architecture Group at STMicroelectronics. He also spent 11 years shaping the nascent ebook market, starting at Mobipocket, which later became the software part of the Amazon Kindle and its mobile variants. He holds a degree from Mines Paris Tech.

Presentations

Getting started with TensorFlow Tutorial

Martin Görner walks you through training and deploying a machine learning system using popular open source library TensorFlow. Martin takes you from a conceptual overview all the way to building complex classifiers and explains how you can apply deep learning to complex problems in science and industry.

Felix Gorodishter is a software architect at GoDaddy. Felix is a web developer, technologist, entrepreneur, husband, and daddy.

Presentations

Big data insights equal big money: Stories from the trenches at GoDaddy Session

GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email.

Matthew Granade is a cofounder of Domino Data Lab, which makes a workbench for data scientists to run, scale, share, and deploy analytical models, where he works with companies such as Quantopian, Premise, and Orbital Insights. He also invests in, advises, and serves on the boards of startups in data, data analysis, finance, and
 financial tech. Previously, Matthew was co-head of research at Bridgewater Associates, where he built and managed teams that ensured Bridgewater’s understanding of the global economy, created new systems for generating alpha, produced daily trading signals, and published Bridgewater’s market commentary, and an engagement manager at McKinsey & Company. He holds an undergraduate degree from Harvard University, where he was president of the Harvard Crimson, the university’s daily newspaper, and an MBA with highest honors from Harvard Business School.

Presentations

Managing data science at scale Session

Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams need to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale.

Kyle Grove is the chief data scientist for Teradata’s Wells Fargo relationship, where he employs his dual background in natural language processing and cognitive science to architect data science solutions that optimize banking functions in risk, compliance, service, and marketing. The productionalized solutions utilize machine learning at scale to predict and nudge human behavior to ends favorable to the bank and its customers.

Presentations

Deep credit risk ranking with LSTM Session

Kyle Grove explains how Teradata and some of world’s largest financial institutions are innovating credit risk ranking with deep learning techniques and AnalyticOps. With the AnalyticOps framework, these organizations have built models with increased accuracy to drive more profitable lending decisions while being explainable to regulators.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Dogfooding data at Lyft Session

Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving.

Goodman Xiaoyuan Gu is head of machine learning architecture at Boston-based Cogito, where he leads operations of large-scale real-time augmented intelligence platform. Previously, he headed marketing data engineering at Atlassian and was vice president of technology at CPXi, director of engineering at Dell, and general manager at Amazon, where he built marketing, analytics and machine learning applications. He has served on technical program committees of two IEEE flagship conferences and is the author of over a dozen academic publications in high-profile IEEE and ACM journals and conferences. Goodman holds a degree in engineering and management from MIT.

Presentations

Not your parents' machine learning: How to ship an XGBoost churn prediction app in under four weeks Session

Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker.

Evan Guarnaccia is a solutions architect in the Internet of Things Division at SAS, where he specializes in real-time analytics and internet of things (IoT) applications using SAS Event Stream Processing and helps customers understand the capabilities of SAS real-time solutions and how they can derive business value with streaming analytics. He also provides internal enablement and training regarding how to position and develop Event Stream Processing projects. Evan holds a PhD in experimental particle physics from Virginia Tech. His research involved collider physics and neutrino detection experiments, and his thesis was on the modeling and measurement of the cosmic muon flux at underground sites.

Presentations

Bringing AI into the IoT (sponsored by SAS) Session

As the internet of things grows, there is an increasing need for sophisticated but lightweight analytics at the edge. Evan Guarnaccia walks you through a multiphase analytics approach to IoT data, analyzing data at rest to discover patterns of interest and develop analytical models that can be easily deployed into a streaming analytics engine out at the edge, in the fog, or in the cloud.

Sudipto Guha is principal scientist at Amazon Web Services, where he studies the design and implementation of a wide range of computational systems, from resource-constrained devices, such as sensors, to massively parallel and distributed systems. Using an algorithmic framework, Sudipto seeks to design systems that are correct, efficient, and optimized despite their bidirectional asymptotic scale and seeming lack of similarity to human information processes. His recent work focuses on clustering and location theory, statistics and learning theory, database query optimization and mining, approximation algorithms for stochastic control, communication complexity, and data stream algorithms.

Presentations

Continuous machine learning over streaming data Session

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.

Debraj GuhaThakurta is a senior data scientist lead for AI and research, the Cloud Data Platform, algorithms, and data science at Microsoft, where he focuses on developing the team data science process and the use of different Microsoft data platforms and toolkits (Spark, SQL Server, ADL, Hadoop, DL toolkits, etc.) for creating scalable and operationalized analytical processes. He has many years of experience using data science and machine learning applications, particularly in biomedical and forecasting domains, and has published more than 25 peer-reviewed papers, book chapters, and patents. Debraj holds a PhD in chemistry and biophysics.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Muhammad Gulzar is a PhD candidate in the Computer Science Department at the University of California, Los Angeles, where he is advised by Miryung Kim. Muhammad’s research interests lie at the intersection of software engineering and big data systems—specifically, in supporting interactive debugging in big data processing frameworks and providing efficient ways to perform automated fault localization in big data applications. He holds an undergraduate degree in computer science from Lahore University of Management Sciences (LUMS) SBASSE in Pakistan, where he was mentored by Fareed Zaffar.

Presentations

Who are we? The largest-scale study of professional data scientists Session

Even though we know that there are more data scientists in the workforce today, neither what those data scientists actually do nor what we even mean by data scientists has been studied quantitatively. Miryung Kim and Muhammad Gulzar share the results of a large-scale survey with 793 professional data scientists and detail several trends about data scientists in the software engineering context.

Alexandra Gunderson is a data scientist at Arundo Analytics. Her background is in mechanical engineering and applied numerical methods.

Presentations

Machine learning to tackle industrial data fusion Session

Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks or even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources.

Sijie Guo is the cofounder of Streamlio, a company focused on building a next-generation real-time data stack. Previously, he was the tech lead for the Messaging Group at Twitter, where he cocreated Apache DistributedLog, and worked on push notification infrastructure at Yahoo. He is the PMC chair of Apache BookKeeper.

Presentations

Modern real-time streaming architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Stream storage with Apache BookKeeper Session

Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage.

Somit Gupta is a senior data scientist with Microsoft’s analysis and experimentation team. Recently, he helped the MSN and Edge browser content teams scale their experimentation and analyze those experiments with better OEC and diagnostic metrics. Somit holds a master’s degree in computer science from the University of Waterloo, Canada.

Presentations

A/B testing at scale: Accelerating software innovation Tutorial

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.

Jordan Hambleton is a Consulting Manager and Senior Architect at Cloudera, where he partners with customers to build and manage scalable enterprise products on the Hadoop stack. Previously, Jordan was a member of technical staff at NetApp, where he designed and implemented the NRT operational data store that continually manages automated support for all of their customers’ production systems.

Presentations

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​ Session

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Chris Harland is director of data engineering at augmented writing platform Textio. Over his career, Chris has worked in a wide variety of fields spanning elementary science education, cutting-edge biophysical research, and recommendation and personalization engines. Previously, he was a data scientist and machine learning engineer at Versive (formerly Context Relevant) and a data scientist at Microsoft working on problems in Bing search, Xbox, Windows, and MSN. Chris holds a PhD in physics from the University of Oregon. Every year he thinks, “This is the year I’m going to stop thinking SQL is the best query language ever,” and every year he’s wrong.

Presentations

Crafting data products for the augmented writing experience Session

The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product.

Patrick Harrison started and leads the data science team at S&P Global Market Intelligence (S&P MI), a business and financial intelligence firm and data provider. The team employs a wide variety of data science tools and techniques, including machine learning, natural language processing, recommender systems, graph analytics, among others. Patrick is the coauthor of the forthcoming book Deep Learning with Text from O’Reilly Media, along with Matthew Honnibal, creator of spaCy, the industrial-strength natural language processing software library, and is a founding organizer of a machine learning conference in Charlottesville, Virginia. He is actively involved in building both regional and global data science communities. Patrick holds a BA in economics and an MS in systems engineering, both from the University of Virginia. His graduate research focused on complex systems and agent-based modeling.

Presentations

Word embeddings under the hood: How neural networks learn from language Session

Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through how it works its magic. Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more.

Frances Haugen is a data product manager at Pinterest focusing on ranking content in the home feed and related pins and the challenges of driving immediate user engagement without harming the long-term health of the Pinterest content ecosystem. Previously, Frances worked at Google, where she founded the Google+ search team, built the first non-quality-based search experience at Google, and cofounded the Google Boston search team. She loves user-facing big data applications and finding ways to make mountains of information useful and delightful to the user. Frances was a member of the founding class of Olin College and holds a master’s degree from Harvard.

Presentations

Executive Briefing: Building effective heterogeneous data communities—Driving organizational outcomes with broad-based data science Session

Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights.

Violeta Hennessey is a senior manager of sales forecasting at Warner Bros. where she has developed various statistical predictive models to address the dynamic and rapidly changing entertainment sector. She is co-author of several publications in Bayesian methodology. As a senior manager at Warner Bros, she strives to make data science vital in identifying the ideal market mix for home video entertainment in order to maximize revenue.

Presentations

A Bayesian choice model for home video entertainment digital demand Media and Ad Tech

Home video electronic sell-through (EST) demand growth is slow and shows signs of a tipping point in 2019. Point-of-sale time series data shows that consumers are sensitive to high-definition (HD) EST price, and unit sales spike when price drops. Violeta Hennessey shares a proactive approach to enhance EST growth through price.

Who has better taste, machines or humans? Media and Ad Tech

Algorithms decide what we see, what we listen to, what news we consume, and myriad other decisions each day. But while they can make many things more efficient, can they outperform humans in areas where the "right" outcome can't be clearly defined? In this Oxford-style debate, two teams will face off, arguing whether or not machines have better taste than humans.

Or Herman-Saffar is data scientist at Dell. She holds an MSc in biomedical engineering, where her research focused on breast cancer detection using breath signals and machine learning algorithms, and a BS in biomedical engineering specializing in signal processing from Ben-Gurion University, Israel.

Presentations

AI-powered crime prediction Session

What if we could predict when and where crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future.

Szehon Ho is a staff software engineer on the analytics data storage team at Criteo, where he works on Criteo’s Hive platform. Previously, he was a software engineer on the Hive team at Cloudera. He was a committer and PMC member in the Apache Hive open source community, working on features like Hive on Spark and Hive monitoring and metrics, among others.

Presentations

Hive as a service Session

Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.

Bob Horton is a senior data scientist on the deep partner engagement team within Microsoft’s AI and Research Group, where he helps independent software vendors build and deploy machine learning solutions for their customers. Previously, he worked on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento. Bob currently holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Shant Hovsepian is a cofounder and CTO of Arcadia Data, where he is responsible for the company’s long-term innovation and technical direction. Previously, Shant was an early member of the engineering team at Teradata, which he joined through the acquisition of Aster Data. Shant interned at Google, where he worked on optimizing the AdWords database, and was a graduate student in computer science at UCLA. He is the coauthor of publications in the areas of modular database design and high-performance storage systems.

Presentations

Executive Briefing: BI on big data Session

There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian outline the trade-offs between a number of architectures that provide self-service access to data and discuss the pros and cons of architectures, deployment strategies, and examples of BI on big data.

Wayne Hu is a partner at SignalFire, a hybrid venture firm and technology company that has built real-time AI platforms and predictive tools in areas such as hiring and competitive intelligence. He helped found the investment team in 2015, and currently leads seed stage investments with a focus on emerging technology platform shifts and historically innovation-sparse sectors. Wayne’s investments to date include data-driven companies such as Juvo, ScopeAR, and Mapper.AI. Previously, Wayne led global strategy for YouTube ads monetization, helping shape go-to-market for Google’s growing multi-billion dollar video business. Wayne received an MBA from Harvard Business School while also working part-time at Kleiner Perkins Caufield & Byers, where he used data to identify potential breakout companies. Prior to that, he earned his B.A. from Princeton in mathematical economics. Wayne was taught to code at a young age by his parents from rural Taiwan who immigrated to the US to become engineers. Today, he lives out his childhood dreams by partnering with early-stage startups using technology to break down barriers to a more connected, open and educated society.

Presentations

Make data work: A VC panel discussion on prospectives and trends Session

To anticipate who will succeed and to invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, top-tier VCs look over the horizon to consider the big trends in how data is being put to work in startups and share what they think the field will look like in a few years (or more).

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.

Presentations

Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber Session

Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges.

Unified and elastic batch and stream processing with Pravega and Apache Flink Session

Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way.

Simon Hughes is the chief data scientist at technology professional recruiting site Dice.com, where he develops multiple recommender engines for matching job seekers with jobs and optimizes the accuracy and relevancy of Dice.com’s job and candidates search. More recently, Simon has been instrumental in building the machine intelligence behind the Career Explorer portion of Dice’s website, which allows users to gauge their market value and explore potential career paths. Simon is a PhD candidate in machine learning and natural language processing at DePaul University, where he is researching machine learning approaches for determining causal relations in student essays, with the view to building more intelligent essay-grading software.

Presentations

Building career advisory tools for the tech sector using machine learning Session

Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production.

Kevin Huiskes is the director of marketing in Intel’s Data Center Group. In his 16 years at Intel, Kevin has held a variety of senior business and marketing positions throughout the company, including two years as chief of staff to the executive vice president of Intel’s Data Center Group. His experience includes managing the Intel Data Center Group central marketing organization, managing the Intel Xeon Scalable Processors product line, business development, and a variety of other product management roles. Prior to Intel, Kevin served as a legislative assistant and committee aide to a member of congress in the US House of Representatives. He holds an MBA from Georgetown University and a BA in political science from Wheaton College.

Presentations

Accelerating analytics and AI from the edge to the cloud (sponsored by Intel) Session

Advanced analytics and AI workloads require a scalable and optimized architecture, from hardware and storage to software and applications. Kevin Huiskes and Radhika Rangarajan share best practices for accelerating analytics and AI and explain how businesses globally are leveraging Intel’s technology portfolio, along with optimized frameworks and libraries, to build AI workloads at scale.

Alysa Z. Hutnik is a partner at Kelley Drye & Warren LLP in Washington, DC, where she delivers comprehensive expertise in all areas of privacy, data security, and advertising law. Alysa’s experience ranges from counseling to defending clients in FTC and state attorneys general investigations, consumer class actions, and commercial disputes. Much of her practice is focused on the digital and mobile space in particular, including the cloud, mobile payments, calling and texting practices, and big data-related services. Ranked as a leading practitioner in the privacy and data security area by Chambers USA, Chambers Global, and Law360, Alysa has received accolades for the dedicated and responsive service she provides to clients. The US Legal 500 notes that she provides “excellent, fast, efficient advice” regarding data privacy matters. In 2013, she was one of just three attorneys under 40 practicing in the area of privacy and consumer protection law to be recognized as a rising star by Law360.

Presentations

Executive Briefing: Legal best practices for making data work Session

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.”

Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on scalable machine learning and AI. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Vlad A. Ionescu is the founder and Chief Architect of ShiftLeft. Vlad is the creator of the industry’s first open source lambda framework. Previously, he worked at Google and VMware as an infrastructure engineer. Vlad is the coauthor RabbitMQ’s Erlang client.

Presentations

Code Property Graph: A modern, queryable data storage for source code Session

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.

Blake Irvine is manager of data engineering and analytics at Netflix, where he leads a team of engineers who make data products for the business development, finance, product management, and engineering teams.

Presentations

Cohort analysis at scale Media and Ad Tech

Blake Irvine explains why cohorts matter and shares a big data solution that enables business users to self-serve partner insights.

Kinnary Jangla is a senior software engineer on the homefeed team at Pinterest, where she works on the machine learning infrastructure team as a backend engineer. Kinnary has worked in the industry for 10+ years. Previously, she worked on maps and international growth at Uber and on Bing search at Microsoft. Kinnary holds an MS in computer science from the University of Illinois and a BE from the University of Mumbai.

Presentations

Accelerating development velocity of production ML systems with Docker Session

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems during debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment.

Ivan Jibaja is a software engineer at Pure Storage, where he leads the team that built a big data analytics pipeline for streaming telemetry data from Pure Storage’s testing infrastructure to classify, prioritize, and understand the root causes of bugs in the software development cycle. Ivan was a part of the core development team that built FlashBlade from the ground up. He holds a PhD in computer science from the University of Texas at Austin with a concentration in compilers and programming languages.

Presentations

When tests cry wolf (sponsored by Pure Storage) Session

Pure Storage redefined QA testing. Using open source technologies like Spark and Kafka, the company deployed a streaming big data analytics pipeline that processes over 70 billion events per day to prioritize, classify, deduplicate, and understand test failures. Ivan Jibaja discusses use cases for big data analytics technologies, the underlying elastic infrastructure, and lessons learned.

Flavio Junqueira is senior director of software engineering at Dell EMC, where he leads the Pravega team. He is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, Flavio held an engineering position with Confluent and research positions with Yahoo Research and Microsoft Research. He is an active contributor to Apache projects, including Apache ZooKeeper (as PMC and committer), Apache BookKeeper (as PMC and committer), and Apache Kafka. Flavio coauthored the O’Reilly ZooKeeper book. He holds a PhD in computer science from the University of California, San Diego.

Presentations

Meet the Expert with Flavio Junqueira (Dell EMC) Meet the Experts

Pravega and Apache Flink make a powerful combination for implementing stream processing pipelines. Come talk with Flavio about modern techniques for implementing such pipelines with the stream storage features of Pravega and the data processing features of Apache Flink.

Unified and elastic batch and stream processing with Pravega and Apache Flink Session

Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way.

Tomer Kaftan is a second-year PhD student at the University of Washington, working with Magdalena Balazinska and Alvin Cheung. His research interests include machine learning systems, distributed systems, and query optimization.  Previously, Tomer was a staff engineer in UC Berkeley’s AMPLab, working on systems for large-scale machine learning. He holds a degree in EECS from UC Berkeley. He is a recipient of an NSF Graduate Research Fellowship.

Presentations

Cuttlefish: Lightweight primitives for online tuning Session

Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.

Joseph Kambourakis is a data science instructor at Databricks. Joseph has more than 10 years of experience teaching, over five of them with data science and analytics. Previously, Joseph was an instructor at Cloudera and a technical sales engineer at IBM. He has taught in over a dozen countries around the world and been featured on Japanese television and in Saudi newspapers. He is a rabid Arsenal FC supporter and competitive Magic: The Gathering player. Joseph holds a BS in electrical and computer engineering from Worcester Polytechnic Institute and an MBA with a focus in analytics from Bentley University. He lives with his wife and daughter in Needham, MA.

Presentations

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML Tutorial

Join Joseph Kambourakis for an introduction to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Playing well together: Big data beyond the JVM with Spark and friends Session

Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).

Brian Karfunkel is a data scientist at Pinterest. Previously, he was senior data analyst at the NYU Furman Center, where he worked on housing and urban policy issues, and a research fellow at Stanford Law School, where he helped research the effects of workplace safety and health policy.

Presentations

Trapped by the present: Estimating long-term impact from A/B experiments Session

When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users.

Aneesh Karve is the CTO of Quilt Data, a Y Combinator company advancing an open source standard for versioned data. Previously, Aneesh was a product manager, lead designer, and software engineer at companies including Microsoft, NVIDIA, and Matterport and the general manager and founding member of AdJitsu, the first real-time 3D advertising platform for iOS (acquired by Amobee in 2012). He holds degrees in chemistry, mathematics, and computer science. Aneesh’s research background spans proteomics, machine learning, and algebraic number theory.

Presentations

Who has better taste, machines or humans? Media and Ad Tech

Algorithms decide what we see, what we listen to, what news we consume, and myriad other decisions each day. But while they can make many things more efficient, can they outperform humans in areas where the "right" outcome can't be clearly defined? In this Oxford-style debate, two teams will face off, arguing whether or not machines have better taste than humans.

Mubashir Kazia is a principal solutions architect at Cloudera and an SME in Apache Hadoop security in Cloudera’s Professional Services practice, where he helps customers secure their Hadoop clusters and comply to internal security policies. He also helps new customers transition to Hadoop platform and implement their first few use cases and trains and mentors peers in Hadoop and Hadoop security. Mubashir has worked with customers from all verticals, including banking, manufacturing, healthcare, telecom, retail, and gaming. Previously, he worked on developing solutions for leading investment banking firms.

Presentations

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments Tutorial

New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR.

Until recently, Arun Kejariwal was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Leveraging live data to realize the smart cities vision Session

One of the key application domains leveraging live data is smart cities, but success depends on the availability of generic platforms that support high throughput and ultralow latency. Arun Kejariwal and Francois Orsini offer an overview of Satori's live data platform and walk you through a country-scale case study of its implementation.

Modern real-time streaming architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Sagar Kewalramani is a Strategic Solution Architect & Data Scientist at Cloudera, where he helps Customers Install, Build, Secure, Optimize & tune their Hadoop clusters. He also helps new customers transition to Hadoop platform and implement their initial use cases. Sagar has worked with customers from all verticals, including Banking, Manufacturing, Healthcare, Retail etc. He has wide experience in building business use cases, high volume real-time data ingestion, transformation and movement, and data lineage and discovery. He has led the discovery and development of big data and machine-learning applications to accelerate digital business and simplify data management and analytics. He has spoken in multiple Hadoop & Big Data Conferences including Oreilly Strata. Previously, he was an Data Architect at Meijer Inc. where he was primary focused in Architecture Design and Administration roles for ETL tools and databases including Teradata.

Presentations

Architecting an open source enterprise data lake Session

With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components.

Kimoon Kim is a software engineer at Pepperdata. Previously, he worked for the Google Search and Yahoo Search teams for many years. Kimoon has hands-on experience with large distributed systems processing massive datasets.

Presentations

HDFS on Kubernetes: Tech deep dive on locality and security Session

There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support.

Miryung Kim is an associate professor in the Department of Computer Science at UCLA as well as the cofounder of MK.Collective. Miryung builds automated software tools, such as debuggers, testing tools, refactoring engines, and code analytics, for improving data scientist productivity and efficiency in developing big data analytics. She also conducts empirical studies of professional software engineers and data scientists in the wild and uses the resulting insights to design novel software engineering tools. Previously, she was an assistant professor in the Department of Electrical and Computer Engineering at the University of Texas at Austin and a visiting researcher at the Research in Software Engineering (RiSE) group at Microsoft Research. Miryung’s honors include an NSF CAREER award, a Microsoft Software Engineering Innovation Foundation Award, an IBM Jazz Innovation Award, a Google Faculty Research Award, an Okawa Foundation Research Grant Award, and an ACM SIGSOFT Distinguished Paper Award. She also received the Korean Ministry of Education, Science, and Technology Award, the highest honor given to an undergraduate student in Korea. Miryung holds a BS in computer science from the Korea Advanced Institute of Science and Technology and an MS and PhD in computer science and engineering from the University of Washington.

Presentations

Meet the Expert with Miryung Kim (UCLA) Meet the Experts

Are you building a data team? Meet Miryung to discuss: different sub cluster characteristics of data scientists, what skill sets, best practices, and tool usage to look for, as well as debugging /testing/workflow tools for data scientists.

Who are we? The largest-scale study of professional data scientists Session

Even though we know that there are more data scientists in the workforce today, neither what those data scientists actually do nor what we even mean by data scientists has been studied quantitatively. Miryung Kim and Muhammad Gulzar share the results of a large-scale survey with 793 professional data scientists and detail several trends about data scientists in the software engineering context.

Eugene Kirpichov is a staff software engineer on the Cloud Dataflow team at Google, where he works on the Apache Beam programming model and APIs. Previously, Eugene worked on Cloud Dataflow’s autoscaling and straggler elimination techniques. He is interested in programming language theory, data visualization, and machine learning.

Presentations

Radically modular data ingestion APIs in Apache Beam Session

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn.

Ronny Kohavi is a Microsoft distinguished engineer and the general manager for the analysis and experimentation team within Microsoft’s Artificial Intelligence and Research Group. Previously, he was partner architect at Bing and founder of the experimentation platform team. Prior to Microsoft, he was the director of data mining and personalization at Amazon; the vice president of business intelligence at Blue Martini Software (acquired by Red Prairie); and manager of the MineSet project, Silicon Graphics’ award-winning product for data mining and visualization. Ronny was the general chair for KDD 2004, cochair of KDD 99’s industrial track with Jim Gray, and cochair of the KDD Cup 2000 with Carla Brodley and has been an invited or keynote speaker at a number of conferences around the world. His papers have over 34,000 citations; three of them are in the top 1,000 most-cited papers in computer science. In 2016, he was named the fifth-most-influential scholar in AI and the twenty-sixth most influential scholar in machine learning. Ronny holds a PhD in machine learning from Stanford University, where he led the MLC++ project (the machine learning library in C++ used in MineSet and at Blue Martini Software), and a BA from the Technion, Israel.

Presentations

A/B testing at scale: Accelerating software innovation Tutorial

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.

Vijay Kotu is vice president of analytics at Oath, a Verizon Company, where he leads the implementation of large-scale data and analytics systems to support the company’s online business. He is the coauthor of Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner.

Presentations

The four elements of modern analytics (sponsored by MicroStrategy) Session

Vijay Kotu details how Oath is using MicroStrategy to combine elements of data science, enterprise mobility, information design, and data lakes in its transformation into an intelligent enterprise.

Evan Kriminger is a Senior Associate of Data Science at ZestFinance, where his research interests include explainability and building efficient tools for training deep neural networks. He holds a PhD from the Computational NeuroEngineering Laboratory at the University of Florida, completing a dissertation on active learning and constrained clustering. Prior to ZestFinance, he worked at Leap Motion, conducting machine learning research for hand tracking.

Presentations

Explaining machine learning models Session

What does it mean to explain a machine learning model, and why is it important? Mike Ruberry offers an overview of several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques presents a different perspective, and their clever application can reveal new insights and solve business requirements.

Chi-Yi Kuan is director of business analytics at LinkedIn. He has over 15 years of extensive experience in applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Presentations

Big data analytics and machine learning techniques to drive and grow business Tutorial

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.

Sanjeev Kulkarni is the cofounder of Streamlio, a company focused on building a next-generation real-time stack. Previously, he was the technical lead for real-time analytics at Twitter, where he cocreated Twitter Heron; worked at Locomatix handling the company’s engineering stack; and led several initiatives for the AdSense team at Google. Sanjeev holds an MS in computer science from the University of Wisconsin-Madison.

Presentations

Effectively once, exactly once, and more in Heron Session

Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and demonstrate how your applications will benefit from using them.

Modern real-time streaming architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Santosh Kulkarni is a product leader at Kaiser Permanente, where he is responsible for driving the development of its intelligence platform systems. Santosh has a deep passion for healthcare and technology and has been an active member in many of the healthcare industry’s forums, which have shaped the healthcare industry in the recent years. An experienced healthcare thought leader, Santosh has advised and supported some of the top global healthcare players in defining and building next-generation healthcare products and solutions. Previously, he spent more than a decade providing strategic, product, and digital transformation consulting and services to healthcare organizations, with focus on digital health and consumer and population health management, and was part of the initial architecture team that built Siemens’s flagship EHR platform, Soarian. Santosh holds a master’s degree in business administration and a bachelor’s degree in computer science and engineering.

Presentations

Spark NLP in action: Improving patient flow forecasting at Kaiser Permanente Session

David Talby and Santosh Kulkarni explain how Kaiser Permanente uses the open source NLP library for Apache Spark to tackle one of the most common challenges with applying natural language process in practice: integrating domain-specific NLP as part of a scalable, performant, measurable, and reproducible machine learning pipeline.

Abhishek Kumar is a Senior Manager, Data science in Sapient’s Bangalore office, where he looks after scaling up the data science practice by applying machine learning and deep learning techniques to domains such as retail, ecommerce, marketing, and operations. Abhishek is an experienced data science professional and technical team lead specializing in building and managing data products from conceptualization to deployment phase and interested in solving challenging machine learning problems. Previously, he worked in the R&D center for the largest power-generation company in India on various machine learning projects involving predictive modeling, forecasting, optimization, and anomaly detection and led the center’s data science team in the development and deployment of data science-related projects in several thermal and solar power plant sites. Abhishek is a technical writer and blogger as well as a Pluralsight author and has created several data science courses. He is also a regular speaker at various national and international conferences (including strata conference ) and universities. Abhishek holds a master’s degree in information and data science from the University of California, Berkeley.

Presentations

Ask Me Anything: Deep learning-based search and recommendation systems using TensorFlow Session

Join Vijay Srinivas Agneeswaran and Abhishek Kumar to discuss recommender systems—particularly deep learning-based recommender systems in TensorFlow—or ask any other questions you have about deep learning.

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Math PhD 1974 Cornell, on faculty at MIT, Harvard, Brown and Stanford. Specialist in adaptive designs for clinical trials, use of observational data for causal effects of treatment, trials embedded in clinical practice.

Presentations

Distributed clinical models: Inference without sharing patient data Session

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.

Francesca Lazzeri is an AI and machine learning scientist on the cloud developer advocacy team at Microsoft. Francesca has multiple years of experience as data scientist and data-driven business strategy expert; she is passionate about innovations in big data technologies and the applications of machine learning-based solutions to real-world problems. Her work on these issues covers a wide range of industries, including energy, oil and gas, retail, aerospace, healthcare, and professional services. Previously, she was a research fellow in business economics at Harvard Business School, where she performed statistical and econometric analysis within the Technology and Operations Management Unit and worked on multiple patent data-driven projects to investigate and measure the impact of external knowledge networks on companies’ competitiveness and innovation. Francesca is a mentor for PhD and postdoc students at the Massachusetts Institute of Technology and enjoys speaking at academic and industry conferences to share her knowledge and passion for AI, machine learning, and coding. Francesca holds a PhD in innovation management.

Presentations

Operationalize deep learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios Session

Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance.

Mike Lee Williams is a research engineer at Cloudera Fast Forward Labs, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Cloudera’s customers understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Interpretable machine learning products Session

Interpretable models result in more accurate, safer, and more profitable machine learning products. But interpretability can be hard to ensure. Michael Lee Williams explores the growing business case for interpretability and its concrete applications, including churn, finance, and healthcare. Along the way, Michael offers an overview of the open source, model-agnostic tool LIME.

Lisha is a principal at Amplify Partners. She invests in technical founders solving ambitious problems. From compute substrate to the creative process, medicine to manufacturing, she is excited to be investing at a time when machine intelligence and data-driven methods have such incredible potential for impact. Investments she has been involved with include Embodied Intelligence and Primer. Lisha completed her PhD at UC Berkeley focusing on deep learning and probability. While at Berkeley she also did statistical consulting, advising on methods and analysis for experimentation and interpretation, and interned as a data scientist at Pinterest and Stitch Fix. She was the lecturer of discrete mathematics, as well as the graduate instructor for probability and computer science theory. Other things she has dabbled in include acting (unionized in Canada) and dance (was an artist with a modern dance company). A fun turn of events one summer visiting Paris as a research mathematician, Lisha found herself the subject of the short films “A Portrait of a Mathematician lady” and “Sizes of Infinity” by filmmaker Olivier Peyon that ties some of these eclectic interests.

Presentations

Make data work: A VC panel discussion on prospectives and trends Session

To anticipate who will succeed and to invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, top-tier VCs look over the horizon to consider the big trends in how data is being put to work in startups and share what they think the field will look like in a few years (or more).

Michael Li is head of analytics at LinkedIn, where he helps define what big data means for LinkedIn’s business and how it can drive business value through the EOI analytics framework. Michael is passionate about solving complicated business problems with a combination of superb analytical skills and sharp business instincts. His specialties include building and leading high-performance teams to quickly meet the needs of fast-paced, growing companies. Michael has a number of years’ experience in big data innovation, business analytics, business intelligence, predictive analytics, fraud detection, analytics, operations, and statistical modeling across financial, ecommerce, and social networks.

Presentations

Big data analytics and machine learning techniques to drive and grow business Tutorial

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.

Wei Lin is senior manager at Dell EMC and chief data scientist for the company’s Big Data Practice, where he is responsible for planning the company’s data science strategy and leading data science services delivery as well as leading data scientist project delivery and the hiring, training, and certification of new data scientists. He hosts Dell EMC’s data science mentorship program, which shares data scientists’ engagement findings, industry experience, techniques, and trends. His successes include developing Dell EMC’s data science field consulting methodology, Descriptive, Exploration, Predictive and Prescriptive (DEPP), which provides a practical analytics roadmap and approaches for an organization’s business initiatives and data and analytic requirements. Wei has over 20 years of experience in predictive analytics, including analytical modeling, architecture design, data warehousing, reporting, and marketing. Previously, he was the principal consultant at IBM, PwC, and Cooper & Lybrand. He has authored over 100 papers, and his work has been published or reported on in professional journals as well as Businessweek and Forbes. Wei holds both an MA and a PhD in electrical engineering, specializing in artificial intelligence, from the State University of New York at Binghamton and a BS in electrical engineering from National Taipei Institute of Technology, Taiwan.

Presentations

Bladder cancer diagnosis using deep learning Session

Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer in patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive.

Ryan Lippert works in product marketing at Google, where he is responsible for developing and communicating Google’s vision for big data and analytics. Previously, Ryan served in a variety of roles at Cisco Systems and Cloudera. He holds an economics degree from the University of Guelph and an MBA from Stanford.

Presentations

Building the bridge from big data to machine learning and artificial intelligence (sponsored by Google Cloud) Session

If your company isn't good at analytics, it's not ready for AI. Ryan Lippert explains how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value.

Billy (Yiming) Liu is the vice president of partner and ecosystem at Kyligence, which provides a leading intelligent data platform powered by Apache Kylin to simplify big data analytics from on-premises to the cloud. Billy is an Apache Kylin PMC and committer and contributes his passion for driving the community to grow, building the ecosystem, and extending the adoption globally. Previously, he was the senior R&D manager at Ctrip.com, focused on the extendable architecture and key components for the 10x program, including messaging middleware and microservice architecture, and platform architect at the Cisco Research and Development Center, where he worked on distributed data management and streaming analytics. He holds a PhD in computer software and theory from Fudan University.

Presentations

Speed up mission-critical analytics in the cloud (sponsored by Kyligence) Session

As organizations look to scale their analytics capability, the need to grow beyond a traditional data warehouse becomes critical, and cloud-based solutions allow more flexibility while being more cost efficient. Billy Liu offers an overview of Kyligence Cloud, a managed Apache Kylin online service designed to speed up mission-critical analytics at web scale for big data.

Shaoshan Liu is the cofounder and president of PerceptIn, a company working on developing a next-generation robotics platform. Previously, he worked on autonomous driving and deep learning infrastructure at Baidu USA. Shaoshan holds a PhD in computer engineering from the University of California, Irvine.

Presentations

Powering robotics clouds with Alluxio Session

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience and enjoys intelligent design and engaging storytelling. He is passionate about data, music, and nature.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Privacy in the age of machine learning Keynote

Ben Lorica shares emerging security best practices for business intelligence, machine learning, and mobile computing products and explores new tools, methods, and products that can help ease the way for companies interested in deploying secure and privacy-preserving analytics.

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Edwina Lu is a software engineer on LinkedIn’s Hadoop infrastructure development team, currently focused on supporting Spark on the company’s clusters. Previously, she worked at Oracle on database replication.

Presentations

Metrics-driven tuning of Apache Spark at scale Session

Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

Nancy Lublin does not sleep very much. Nancy is the founder and CEO of Crisis Text Line, which has processed over 50 million messages in four years and is one of the first “big data for good” orgs. Previously, she was CEO of DoSomething.org for 12 years, taking it from bankruptcy to the largest organization for teens and social change in the world. Her first venture was Dress for Success, which helps women transition from welfare to work in almost 150 cities in 22 countries. She founded this organization with a $5,000 inheritance from her great-grandfather. Before leading three of the most popular charity brands in America, Nancy was a bookworm. She is the author of four books and is a board member of McGraw-Hill Education. She studied politics at Brown University, political theory at Oxford University as a Marshall Scholar, and has a law degree from New York University. Nancy was a judge for 2017’s Miss USA Pageant (an honor she thought was hilarious) and is a Young Global Leader of the World Economic Forum. (She has attended Davos multiple times.) She was named Schwab Social Entrepreneur of the Year in 2014 and has been recognized as one of the NonProfit Times’s Power and Influence Top 50 three times. She is married to Jason Diaz and has two children who have never tasted Chicken McNuggets.

Presentations

Crisis Text Line data usage and insights Keynote

Nancy Lublin shares insights from Crisis Text Line.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Over his career, he has been responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Ask Me Anything: Streaming architectures and applications (Kafka, Spark, Akka, and microservices) Session

Join Dean Wampler and Boris Lublinsky to discuss all things streaming, from architecture and implementation to streaming engines and frameworks. Be sure to bring your questions about techniques for serving machine learning models in production, traditional big data systems, or software architecture in general.

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams Tutorial

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead.

Daniel Lurie leads the product analytics and science team at Pinterest, a group that mixes deep data skills with strategic thinking to help Pinterest’s product team grow the company’s user base, develop new features, and increase engagement. The team’s work ranges from understanding product performance via A/B experiment analysis to identifying and sizing market opportunities to defining and tracking success through metrics. Previously, Dan led analytics for a sales-focused business line at LinkedIn and worked in consulting.

Presentations

Breaking up the block: Using heterogenous population modeling to drive growth Session

All successful startups thrive on tight product-market fit, which can produce homogenous initial user bases. To become the next big thing, your user base will need to diversify, and your product must change to accommodate new needs. Daniel Lurie explains how Pinterest leverages external data to measure racial and income diversity in its user base and changed user modeling to drive growth.

Harrison Lynch is the Senior Director of Product Management for the Consensus Marketplace Cloud and Risk Cloud products. Ever the Digital Sheriff, even when deep in the midst of a re-platforming and developing an API product offering, he keeps himself awake at night pondering how to prevent identity thieves from stealing iPhones and how to manage multiple Product Managers and dev teams. Harrison has spent the past fifteen years of his career in the niche world of integrating cell phone carrier systems to web applications, while trying to stop fraudsters. He loves a good nerd fight, and has never seen a Product Management blog that he won’t dive into.

Harrison is a 1995 graduate of the University of Oregon.

Presentations

Detecting retail fraud with data wrangling and machine learning Session

Matt Derda and Harrison Lynch explain how Consensus leverages the combined power of data wrangling and machine learning to more efficiently identify and reduce retail fraud and how adopting data wrangling technology has helped Trifacta reduce time spent data wrangling from six weeks to one week.

Kevin Lyons is senior vice president of data science for digital technology at Nielsen, where he is responsible for leading the vision and execution of Nielsen Marketing Cloud’s analytics and data optimization activities. Previously, Kevin was vice president of analytics and business intelligence at x+1, a leader in audience targeting that leverages sophisticated statistical modeling to surpass traditional online marketing techniques, where he strove to maximize profitable website user behavior via analytics and real-time decisioning; spent over a decade as a vice president responsible for web and marketing analytics at QualityHealth.com, a leading website providing consumer health news and information, and at Harte-Hanks, a large marketing service provider; and served in account management at Grey Direct. Kevin holds a BA in Russian language and Eastern European studies from the University of Illinois at Urbana-Champaign, an MA in medieval history from the Ohio State University, and an MA in applied statistics from Hunter College.

Presentations

Marketing at future speed Media and Ad Tech

Consumer behavior is in a constant state of flux. Adapting to these changes is especially hard, given the staggering amount of big data marketers need to understand and act on. Kevin Lyons offers an overview of Online Learning, a cutting-edge AI technology that uses event-level data streams to build and adapt models in real time.

Angie Ma is cofounder and COO of ASI Data Science, a London-based AI tech startup that offers data science as a service, which has completed more than 120 commercial data science projects in multiple industries and sectors and is regarded as the EMEA-based leader in data science. Angie is passionate about real-world applications of machine learning that generate business value for companies and organizations and has experience delivering complex projects from prototyping to implementation. A physicist by training, Angie was previously a researcher in nanotechnology working on developing optical detection for medical diagnostics.

Presentations

Data science for managers 2-Day Training

Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Data science for managers (Day 2) Training Day 2

Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Sean Ma is the Director of Product Management at Trifacta. With over 10+ years of experience in enterprise data management software, Sean has spent the last 5 years building Big Data products at companies such as Informatica and Trifacta. He holds a Bachelor of Science degree in Electrical Engineering and Computer Science from the University of California Berkeley.

Presentations

Semi-automated analytic pipeline creation and validation using active learning Session

Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines.

Madhav Madaboosi is a digital business and technology strategist within the Strategy, Architecture, and Planning Group at BP, where he leads a number of global innovation initiatives in the areas of robotic process automation, AI, big data, data lakes, and the industrial IoT. Previously, Madhav was the interface to several business portfolios within BP as a business information manager. Prior to BP, he worked in management consulting for a number of Fortune 100 firms. Madhav holds a degree in business and has completed executive programs at the Kellogg Institute of Management.

Presentations

Meta your data; drain the big data swamp Data Case Studies

Madhav Madaboosi and Meenakshisundaram Thandavarayan offer an overview of BP's self-service operational data lake, which improved operational efficiency, boosting productivity through fully identifiable data and reducing risk of a data swamp. They cover the path and big data technologies that BP chose, lessons learned, and pitfalls encountered along the way.

Mark Madsen is the global head of architecture at Think Big Analytics, where he is responsible for understanding, forecasting, and defining the analytics landscape and architecture. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning and vendors on product management. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide.

Presentations

Executive Briefing: BI on big data Session

There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian outline the trade-offs between a number of architectures that provide self-service access to data and discuss the pros and cons of architectures, deployment strategies, and examples of BI on big data.

Arup Malakar is a software engineer at Lyft.

Presentations

Dogfooding data at Lyft Session

Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving.

Ted Malaska is currently a Director of Enterprise Architecture at Capital One, before that he was the Director of Engineering at Blizzard’s Global Insight Department. Ted was also principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Time series data: Architecture and use cases Tutorial

If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data.

Jules Malin is a manager of product analytics and data science at GoPro, where he leads a team responsible for discovering product and behavioral insights from GoPro’s growing family and ecosystem of smart devices and driving product and user experience improvements, including influencing and refining data pipelines in Hadoop/Spark and developing scalable machine learning data products, metrics, and visualizations that produce actionable insights. Previously, Jules worked in product management and analytics engineering at Intel and Shutterfly. He holds a master’s degree in predictive analytics from Northwestern University.

Presentations

Drone data analytics using Spark, Python, and Plotly Data Case Studies

Drones and smart devices are generating billions of event logs for companies, presenting the opportunity to discover insights that inform product, engineering, and marketing team decisions. Jules Malin explains how technologies like Spark and analytics and visualization tools like Python and Plotly enable those insights to be discovered in the data.

Katie Malone is director of data science at data science software and services company Civis Analytics, where she leads a team of diverse data scientists who serve as technical and methodological advisors to the Civis consulting team and write the core machine learning and data science software that underpins the Civis Data Science Platform. Previously, she worked at CERN on Higgs boson searches and was the instructor of Udacity’s Introduction to Machine Learning course. Katie hosts Linear Digressions, a weekly podcast on data science and machine learning. She holds a PhD in physics from Stanford.

Presentations

Building a data science idea factory: How to prioritize the portfolio of a large, diverse, and opinionated data science team Session

A huge challenge for data science managers is determining priorities for their teams, which often have more good ideas than they have time. Katie Malone and Skipper Seabold share a framework that their large and diverse data science team uses to identify, discuss, select, and manage data science projects for a fast-moving startup.

From the presidential campaign trail to the enterprise: Building effective data-driven teams Data Case Studies

The 2012 Obama campaign ran the first personalized presidential campaign in history. The data team was made up of people from diverse backgrounds who embraced data science in service of the goal. Civis Analytics emerged from this team and today enables organizations to use the same methods outside politics. Katie Malone shares lessons learned from these experiences for building effective teams.

Veronica Mapes is a technical program manager focused on human evaluation and computation at Pinterest, where she manages Pinterest’s internal human evaluation platform, maturing it from just an idea to a self-service platform with a 10 million annual run rate of tasks less than six months after launch, as well as third-party communities of crowdsourcing raters. She also hires, trains, and manages high-quality content evaluators and tests template and worker quality to ensure the delivery of highly accurate data for time series measurement and training machine learning models.

Presentations

Humans versus the machines: Using human-based computation to improve machine learning Session

Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform.

Taylor Martin is principal learning scientist at O’Reilly Media, where she helps a team of data scientists and engineers mix in just the right amount of data-driven learning engineering to personalize the learning experience across various forms of published media. Taylor’s research focuses on understanding how people learn, and she’s particularly interested in how adaptive and personalized learning can best be used to help people reach their learning goals faster. As an established academic and thought leader in the learning sciences, Taylor has spearheaded data-centric approaches to developing learning environments and measuring how people learn science, math, engineering, and computer science in environments that include online games, online programming environments (e.g., scratch.mit.edu), internship programs, Maker spaces, and engineering design labs.

Presentations

Data Case Studies tutorial welcome Tutorial

Taylor Martin welcomes you to Data Case Studies.

Hilary Mason is vice president of research at Cloudera Fast Forward Labs and data scientist in residence at Accel Partners. Previously, Hilary was chief scientist at Bitly. She cohosts DataGotham, a conference for New York’s homegrown data community, and cofounded HackNY, a nonprofit that helps engineering students find opportunities in New York’s creative technical economy. She’s on the board of the Anita Borg Institute and an advisor to several companies, including SparkFun Electronics, Wildcard, and Wonder. Hilary served on Mayor Bloomberg’s Technology Advisory Board and is a member of Brooklyn hacker collective NYC Resistor.

Presentations

Machine learning: What’s real and what’s hype Keynote

The power of machine learning is very real, but so too is the hype and confusion about when, where, and how to apply it. Hilary Mason explores practical business applications for intelligent machines and details the tools and processes required to implement machine learning successfully.

Andrew Mattarella-Micke is a senior data scientist at Intuit, specializing in deep learning for NLP applications. Previously, Andrew was a postdoctoral fellow at Vanderbilt and Stanford, where he studied the brain networks underlying mathematical development. He holds a PhD in cognitive neuroscience from the University of Chicago.

Presentations

Want to build a better chatbot? Start with your data. Session

When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Andrew Mattarella-Micke shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices he's learned along the way.

Terry McFadden is the principal enterprise information architect at Procter & Gamble, where he drives the company’s big data efforts. Terry has a long history of attacking wicked problems in the data area. He holds several US patents and was an early text analytics practitioner. Terry holds an MBA from Xavier University.

Presentations

BI and big data convergence in modern cloud architecture (sponsored by Arcadia Data) Session

Procter & Gamble relies heavily on data, particularly for BI. Running compute where the data lives is critical for performance, and the company has found added benefits to this architecture, which complements its Hadoop and BI needs. Terry McFadden offers an overview of P&G's modern analytics architecture and explains how it differs from traditional approaches.

Brian McMahan is a research engineer at Joostware, a San Francisco-based company specialized in consulting and building intellectual property in natural language processing and deep learning. He is also a cofounder at R7 Speech Sciences, a company focused on understanding spoken conversations. Brian is wrapping up his PhD in computer science from Rutgers University, where his research focuses on Bayesian and deep learning models for grounding perceptual language in the visual domain. Brian has also conducted research in reinforcement learning and various aspects of dialogue systems.

Presentations

Machine learning with PyTorch 2-Day Training

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Machine learning with PyTorch (Day 2) Training Day 2

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Guru Medasani is a Data Science Architect at Domino Data Lab. He helps small and large enterprises in building efficient machine learning pipelines. Previously he was a senior solutions architect at Cloudera, where he helped customers build big data platforms and leverage technologies like Apache Hadoop and Apache spark to solve complex business problems. Some of the business applications he’s worked on include applications for collecting, storing, and processing huge amounts of machine and sensor data, image processing applications on Hadoop, machine learning models to predict consumer demand, and tools to perform advanced analytics on large volumes of data stored in Hadoop.

Presentations

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​ Session

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Dong Meng is a data scientist at MapR, where he helps customers solve their business problems with big data by translating the value from customers’ data and turns it into actionable insights or machine learning products. His recent work includes integrating open source machine learning frameworks like PredictionIO and XGBoost with MapR’s platform. He also created time series QSS and deep learning QSS as a MapR service offering. Dong has several years of experience in statistical machine learning, data mining, and big data product development. Previously, he was a senior data scientist with ADP, where he built machine learning pipelines and data products for HR using payroll data to power ADP Analytics, and a staff software engineer with IBM, SPSS, where he was part of the team that built Watson analytics. During his graduate study at the Ohio State University, Dong served as research assistant, where he concentrated on compressive sensing and solving point estimation problems from a Bayesian perspective.

Presentations

Distributed deep learning with containers on heterogeneous GPU clusters Session

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.

Peng Meng is a senior software engineer on the big data and cloud team at Intel, where he focuses on Spark and MLlib optimization. Peng is interested in machine learning algorithm optimization and large-scale data processing. He holds a PhD from the University of Science and Technology of China.

Presentations

Spark ML optimization at Intel: A case study Session

Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on with Spark ML and introduce the methodology behind Intel's work on SparkML optimization.

Matteo Merli is a software engineer at Streamlio working on messaging and storage technologies. Previously, he spent several years at Yahoo building database replication systems and multitenant messaging platforms. Matteo was the architect and lead developer for Yahoo Pulsar and a member of the PMC of Apache BookKeeper.

Presentations

Effectively once in Apache Pulsar, the next-generation messaging system Session

Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty.

Gian Merlino is CTO and cofounder of Imply and is one of the original committers of the Druid project. Previously, he worked at Metamarkets and Yahoo. Gian holds a BS in computer science from the California Institute of Technology.

Presentations

NoSQL no more: SQL on Druid with Apache Calcite Session

Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects.

John Mertic is director of program management for ODPi and the Open Mainframe Project at the Linux Foundation. John comes from a PHP and open source background. Previously, he was director of business development software alliances at Bitnami, a developer, evangelist, and partnership leader at SugarCRM, board member at OW2, president of OpenSocial, and a frequent speaker at conferences around the world. As an avid writer, John has published articles on IBM Developerworks, Apple Developer Connection, and PHP Architect and authored The Definitive Guide to SugarCRM: Better Business Applications and Building on SugarCRM.

Presentations

The rise of big data governance: Insight on this emerging trend from active open source initiatives Session

John Mertic and Maryna Strelchuk detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share how companies like ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative.

Thomas W. Miller is faculty director of the Data Science Program at Northwestern University, where he has developed and taught a number of courses, including practical machine learning, web information retrieval, and network data science. In addition, he consults with businesses about performance and value measurement, data science methods, information technology, and best practices for building teams of data scientists and data engineers. Thomas has written six books about data science.

Presentations

Working with the data of sports Data Case Studies

Sports analytics today is more than a matter of analyzing box scores and play-by-play statistics. Faced with detailed on-field or on-court data from every game, sports teams face challenges in data management, data engineering, and analytics. Thomas Miller details the challenges faced by a Major League Baseball team as it sought competitive advantage through data science and deep learning.

Nina Mishra is principal scientist at Amazon Web Services, where she focuses on data science, data mining, web search, machine learning, and privacy. Nina has many years of experience leading projects at Amazon, Microsoft Research, and HP Labs. She was also an associate professor at the University of Virginia and an acting faculty member at Stanford University. Nina’s research encompasses the design and evaluation of new data mining algorithms on real, colossal-sized datasets. She has authored almost 50 publications in top venues, including WWW, WSDM, SIGIR, ICML, NIPS, AAAI, COLT, VLDB, PODS, CRYPTO, EUROCRYPT, FOCS, and SODA, which have been recognized with best paper award nominations. Nina’s research was central to the Bing search engine and has been widely featured in external press coverage. Nina holds 14 patents with a dozen more still in the application stage. She has had the distinct privilege of helping others advance in their careers, including 15 summer interns and many full-time researchers. Nina’s service to the community includes serving on journal editorial boards Machine Learning, the Journal of Privacy and Confidentiality, IEEE Transactions on Knowledge and Data Engineering, and IEEE Intelligent Systems and chairing the premier machine learning conference ICML in 2003, as well as serving on numerous program committees for web search, data mining, and machine learning conferences. She was awarded an NSF grant as a principal investigator and has served on eight PhD dissertation committees.

Presentations

Continuous machine learning over streaming data Session

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.

Rajat Monga leads TensorFlow, an open source machine learning library and the center of Google’s efforts at scaling up deep learning. He is one of the founding members of the Google Brain team and is interested in pushing machine learning research forward toward general AI. Previously, Rajat was the chief architect and director of engineering at Attributor, where he led the labs and operations and built out the engineering team. A veteran developer, Rajat has worked at eBay, Infosys, and a number of startups.

Presentations

The current state of TensorFlow and where it's headed in 2018 Session

Rajat Monga offers an overview of TensorFlow's progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas.

Ajay Mothukuri is an architect on the data technologies team at Sapient.

Presentations

Achieving GDPR compliance and data privacy using blockchain technology Session

Ajay Mothukuri and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.

Manu Mukerji is senior director of data, machine learning, and analytics at 8×8. Manu’s background lies in cloud computing and big data, working on systems handling billions of transactions per day in real time. He enjoys building and architecting scalable, highly available data solutions and has extensive experience working in online advertising and social media.

Presentations

Machine learning versus machine learning in production Session

Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more.

Rodney Mullen is widely considered the most influential skateboarder in the history of the skateboarding. Despite Alan Gelfand’s justifiable fame for inventing the ollie air (primarily a vert or pool-oriented trick), Rodney is responsible for the invention and development of the street ollie. The ability to pop the board off of the ground and land back on the board while moving has quite likely been the most significant development in modern skateboarding. This invention alone would rank Mullen the most important skateboarder of all time. The majority of ollie and flip tricks he invented throughout the 1980s, including the flatground ollie, the kickflip, the heelflip, and the 360 flip, are now fundamental aspects of modern vertical and street skateboarding. Rodney’s career highlights include winning nearly 30 contests as a child, skating for the Powell-Peralta Bones Brigade, and founding World Industries and the Almost skateboarding company. He has been featured in numerous videos, including Bones Brigade videos, the 1988 film Gleaming the Cube, alongside actor Christian Slater, World Industries’ Rubbish Heap, Plan B’s Questionable, Virtual Reality, and Second Hand Smoke, the Rodney Mullen vs. Daewon Song series, Globe Opinion, and Almost: Round Three. In 2002, Rodney won the Transworld Reader’s Choice Award for Skater of the Year. He is the author of The Mutt: How to Skateboard and Not Kill Yourself.

Presentations

Small pieces, loosely joined: A skater's code Session

The essence of modern skating is learning tricks that couple with specific terrain. Activision’s video game franchise testifies to the nearly endless possibilities. Rodney Mullen offers a nuanced look at how skaters nudge the endpoints of disparate submovements to create new combinations that may shine a different light on ideas in machine learning—plus it’s a lot of fun.

Ash Munshi is CEO of Pepperdata. Previously, Ash was executive chairman for deep learning startup Marianas Labs (acquired by Askin in 2015); CEO of big data storage startup Graphite Systems (acquired by EMC DSSD in 2015); CTO of Yahoo; and CEO of a number of other public and private companies. He serves on the board of several technology startups.

Presentations

Classifying job execution using deep learning Session

Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O, and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series.

Aaron T. Myers is a Software Engineer at Cloudera and an Apache Hadoop Committer. Aaron’s work is primarily focused on HDFS. Prior to joining Cloudera, Aaron was a Software Engineer and VP of Engineering at Amie Street, where he worked on all components of the software stack, including operations, infrastructure, and customer-facing feature development. Aaron holds both an Sc.B. and Sc.M. in Computer Science from Brown University.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Data reflections: Making data fast and easy to use without making copies Session

Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies.

Balasubramanian Narasimhan is a senior research scientist in the Department of Statistics and the Department of Biomedical Data Sciences at Stanford University and the director of the Data Coordinating Center within the Department of Biomedical Data Sciences. His research areas include statistical computing, distributed computing, clinical trial design, and reproducible research. Balasubramanian coteaches a computing for data science course with John Chambers, an inventor of the S language.

Presentations

Distributed clinical models: Inference without sharing patient data Session

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly Media and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of JupyterCon and an advisor for Amplify Partners, Deep Learning Analytics, and Recognai. He was named one of the top 30 people in big data and analytics in 2015 by Innovation Enterprise.

Presentations

Human in the loop: A design pattern for managing teams working with machine learning Session

Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media.

Ann Nguyen evangelizes design for impact at Whole Whale, where she leads the tech and design team in building meaningful digital products for nonprofits. She has designed and managed the execution of multiple websites for organizations including the LAMP, Opportunities for a Better Tomorrow, and Breakthrough. Ann is always challenging designs with A/B testing. She bets $1 on every experiment that she runs and to date has accumulated a decent sum. Previously, Ann worked with a wide range of organizations from the Ford Foundation to Bitly. She is Google Analytics and Optimizely Platform certified. Ann is a regular speaker on nonprofit design and strategy and recently presented at the DMA Nonprofit Conference. She has also taught at Sarah Lawrence College. Outside of work, Ann enjoys multisensory art, comedy shows, fitness, and making cocktails, ideally all together.

Presentations

Using ML to improve UX and literacy for young poets Data Case Studies

Power Poetry is the largest online platform for young poets, with over 350K users. Ann Nguyen explains how Power Poetry is extending the learning potential with machine learning and covers the technical elements of the Poetry Genome, a series of ML tools to analyze and break down similarity scores of the poems added to the site.

Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Continuous machine learning over streaming data Session

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.

Dinesh Nirmal’s development mission is to build technology that lets businesses operationalize new technologies like machine learning and blockchain and achieve immediate value. As vice president of development for IBM Analytics, Dinesh leads 10 global development labs and design studios. Major releases during his tenure include IBM Data Science Experience (winner of the iF Product Design Award and the Red Dot Design Award), Machine Learning for z/OS, IBM Integrated Analytics System, and IBM Cloud Private for Data (a platform to accelerate analytics and operationalize AI). He also launched six machine learning hubs to work face-to-face with clients. Dinesh speaks and writes internationally on the principles required for successful enterprise machine learning and serves on the board of the R Consortium.

Presentations

Operationalizing machine learning (sponsored by IBM) Keynote

Machine learning research and incubation projects are everywhere, but less common, and far more valuable, is the innovation unlocked once you bring machine learning out of research and into production. Dinesh Nirmal explains how real-world machine learning reveals assumptions embedded in business processes and in the models themselves that cause expensive and time-consuming misunderstandings.

Berk Norman is a data scientist in the Department of Radiology and Biomedical Imaging at UC San Francisco, where he works on constructing deep learning models.

Presentations

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

A leading expert on big data architectures, Stephen O’Sullivan has 25 years of experience creating scalable, high-availability data and applications solutions. A veteran of Silicon Valley Data Science, @WalmartLabs, Sun, and Yahoo. Stephen is an independent adviser to enterprises on all things data..

Presentations

Enough data engineering for a data scientist; or, How I learned to stop worrying and love the data scientists Session

Stephen O'Sullivan takes you along the data science journey, from onboarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it Session

Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.

Andrew Parker is a General Partner at Spark Capital, focusing on early-stage investments.

Andrew has led Spark’s investments in Carta, Kik, Panorama Education, Socratic, Splash, Timehop, Particle, and Quantopian.

Andrew surveys the startup landscape through the lens of a recovering product designer and engineer. He is consistently drawn to entrepreneurs who are makers—the ones who, in the early days, are actually hands on building the product experience themselves.

Prior to joining Spark in 2010, Andrew was a member of the investment team at Union Square Ventures. Before becoming an investor, Andrew did UI design and user-experience testing at Homestead Technologies and was a web developer at Groupspace.org.

Andrew holds a B.S. in Symbolic Systems from Stanford University. When not working, you’ll find Andrew outside: running, hiking, skating, or doing yoga. Andrew lives in Palo Alto with his wife (a cancer surgeon) and their cat.

Presentations

Make data work: A VC panel discussion on prospectives and trends Session

To anticipate who will succeed and to invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, top-tier VCs look over the horizon to consider the big trends in how data is being put to work in startups and share what they think the field will look like in a few years (or more).

Andrea Pasqua is a data science manager at Uber, where he leads the time series forecasting and anomaly detection teams. Previously, Andrea was director of data science at Radius Intelligence, a company spearheading the use of machine learning in the marketing space; a financial analyst at MSCI, a leading company in the field of risk analysis; and a postdoctoral fellow in biophysics at UC Berkeley. He holds a PhD in physics from UC Berkeley.

Presentations

Detecting time series anomalies at Uber scale with recurrent neural networks Session

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.

Mo Patel is an independent deep learning consultant advising individuals, startups, and enterprise clients on strategic and technical AI topics. Mo has successfully managed and executed data science projects with clients across several industries, including cable, auto manufacturing, medical device manufacturing, technology, and car insurance. Previously, he was practice director for AI and deep learning at Think Big Analytics, a Teradata Company, where he mentored and advised Think Big clients and provided guidance on ongoing deep learning projects; he was also a management consultant and a software engineer earlier in his career. A continuous learner, Mo conducts research on applications of deep learning, reinforcement learning, and graph analytics toward solving existing and novel business problems and brings a diversity of educational and hands-on expertise connecting business and technology. He holds an MBA, a master’s degree in computer science, and a bachelor’s degree in mathematics.

Presentations

Learning PyTorch by building a recommender system Tutorial

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model.

Neejole Patel is a sophomore at Virginia Tech, where she is pursuing a BS in computer science with a focus on machine learning, data science, and artificial intelligence. In her free time, Neejole completes independent big data projects, including one that tests the Broken Windows theory using DC crime data. She recently completed an internship at a major home improvement retailer.

Presentations

Learning PyTorch by building a recommender system Tutorial

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model.

Rizwan Patel is senior director of big data, innovation, and emerging technology at Caesars Entertainment. A senior technologist with strong leadership skills coupled with hands-on application and system expertise, Rizwan has a proven track record of delivering large-scale, mission-critical projects on time and budget using leading-edge technologies to solve critical business problems as well as extensive experience in managing client relations at all levels, including senior executives.

Presentations

Big data applicability to the gaming industry Media and Ad Tech

Rizwan Patel explains how the gaming industry can leverage Cloudera’s big data platform to adapt to the change in patron dynamics (both in terms of demographics and spending patterns) to create a new paradigm for customer (micro)segmentation.

Vanja Paunic is a data scientist in the Algorithms and Data Science Group at Microsoft London. She works on building machine learning solutions with external companies utilizing Microsoft’s AI Cloud Platform. She holds a PhD in computer science with a focus on data mining in the biomedical domain from the University of Minnesota.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Valentina Pedoia is a specialist in the Musculoskeletal and Imaging Research Group at UCSF and a data scientist focusing on developing algorithms for advanced computer vision and machine learning for improving the usage of noninvasive imaging as diagnostic and prognostic tools. Her current research explores the role of machine learning in the extraction of contributors to osteoarthritis (OA), and she is studying analytics to model the complex interactions between morphological, biochemical, and biomechanical aspects of the knee joint as a whole and deep learning convolutional neural network for musculoskeletal tissue segmentation and for the extraction of silent features from quantitative relaxation maps for a comprehensive study of the biochemical articular cartilage composition with the ultimate goal of developing a completely data-driven model that is able to extract imaging features and use them to identify risk factors and predict outcomes. Previously, she was a postdoc in the Musculoskeletal and Imaging Research Group, where she provided support and expertise in medical computer vision with a focus on reducing human effort and extracting semantic features from MRIs to study degenerative joint disease. Valentina’s recent work on machine learning applied to OA was awarded as annual scientific highlights of the 25th conference of the International Society of Magnetic Resonance In Medicine (ISMRM 2017) and selected as best paper presented at the MRI drug discovery study group. Valentina holds a PhD in computer science, where her research focused on feature extraction from functional and structural brain MRI in subjects with glial tumors.

Presentations

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

Andreas Pfadler is a machine learning engineer at Talking Data. He holds a PhD in mathematics and previously worked as a consultant in the financial industry. He is passionate about math, machine learning, software architecture, and cooking. He currently lives in Beijing.

Presentations

On-device deep learning: Trends, technologies, and challenges (sponsored by TalkingData) Session

Andreas Pfadler offers an overview of current technological trends for on-device deep learning and edge computing. Along the way, Andreas explores major players and platforms and computational challenges and solutions. Andreas concludes with a discussion of TalkingData's vision for the future of mobile deep learning.

Thomas Phelan is cofounder and chief architect of BlueData. Previously, Tom was an early employee at VMware; as senior staff engineer, he was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular pluggable storage architecture. He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, Tom was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64-bit filesystem.

Presentations

How to protect big data in a containerized environment Session

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them.

Patrick Phelps is the lead data scientist on ads at Pinterest, focusing on auction dynamics and advertiser success. Previously, Patrick was the lead data scientist at Yelp, leading a team focusing on projects as diverse as search, ads, delivery operations, and HR. He has an engineering background in traffic quality (the art of distinguishing automated systems and malicious actors from legitimate users across a variety of platforms) and held an Insight Data Science fellowship. Patrick is passionate about the ability of data to provide key, quantitative insights to businesses during the decision-making process and is an advocate for data science education across all layers of a company. Patrick holds a PhD in experimental high-energy particle astrophysics.

Presentations

Executive Briefing: Building effective heterogeneous data communities—Driving organizational outcomes with broad-based data science Session

Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights.

Marcin Pilarczyk is a data scientist and a leader of Ryanair’s Data and Analytics Department. Marcin has around 14 years of professional experience working in the aviation, telco, and financial industries in topics including data science, big data solutions, and data warehouses.

Presentations

Data-driven fuel management at Ryanair Session

Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions.

Sangeeth is a strategic and result oriented technology leader with 20-plus years of broad leadership experience in technology strategy and enterprise solutions delivery in retail, hospitality and financial services domains. He has extensive experience in overseeing strategic programs and managed portfolios encompassing e-Commerce, CRM/Loyalty, Supply Chain, Distribution, Business Intelligence, Data Analytics and Operations across a variety of technology platforms. He has a passion for building high performing teams and motivating individuals to maximize organizational productivity.

Presentations

Automating decisions with data in the cloud Keynote

Amr Awadallah explains why the cloud requires a different approach to machine learning and analytics and what you can do about it.

Jennifer Prendki is the vice president of machine learning at Figure Eight, the essential human-in-the-loop AI platform for data science and machine learning teams. She has spent most of her career creating a data-driven culture wherever she went, succeeding in sometimes highly skeptical environments. She is particularly skilled at building and scaling high-performance machine learning teams and is known for enjoying a good challenge. Trained as a particle physicist (she holds a PhD in particle physics from Sorbonne University), she likes to use her analytical mind not only when building complex models but also as part of her leadership philosophy. She is pragmatic yet detail oriented. Jennifer also takes great pleasure in addressing both technical and nontechnical audiences alike at conferences and seminars and is passionate about attracting more women to careers in STEM.

Presentations

The science of patchy data Session

Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation.

Michael Prorock is founder and CTO at mesur.io. Michael is an expert in systems and analytics, as well as in building teams that deliver results. Previously, he was director of emerging technologies for the Bardess Group, where he defined and implemented a technology strategy that enabled Bardess to scale its business to new verticals across a variety of clients, and worked in analytics for Raytheon, Cisco, and IBM, among others. He has filed multiple patents related to heuristics, media analysis, and speech recognition. In his spare time, Michael applies his findings and environmentally conscious methods on his small farm.

Presentations

Smart agriculture: Blending IoT sensor data with visual analytics Data Case Studies

Mike Prorock offers an overview of mesur.io, a game-changing climate awareness solution that combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market. Mesur.io enables growers to monitor areas of concern, providing immediate benefits to crop yield, supply costs, farm labor overhead, and water consumption.

Jiangjie Qin is a software engineer on the data infrastructure team at LinkedIn, where he works on Apache Kafka. Previously, Jiangjie worked at IBM, where he managed IBM’s zSeries platform for banking clients. He is a Kafka PMC member. Jiangjie holds a master’s degree in information networking from Carnegie Mellon’s Information Networking Institute.

Presentations

The secret sauce behind LinkedIn's self-managing Kafka clusters Session

LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention.

Paul Raff is a principal data scientist manager on Microsoft’s analysis and experimentation team, where he and his team work to enable scalable experimentation for teams around Microsoft, including Windows 10, Office Online, Exchange Online, and Cortana, focusing on experiment quality and ensuring that all experiments are operating as intended and in a way that allows for the appropriate conclusions to be made. Previously, he was a supply chain researcher at Amazon. Paul holds a PhD in mathematics from Rutgers University as well as degrees in mathematics and computer science from Carnegie Mellon University.

Presentations

A/B testing at scale: Accelerating software innovation Tutorial

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.

Syed Rafice is a principal system engineer at Cloudera specializing in big data on Hadoop technologies and both platform and cybersecurity. He is responsible for designing, building, developing, and assuring a number of enterprise-level big data platforms using the Cloudera distribution. Syed has worked across multiple sectors including government, telecoms, media, utilities, financial services, and transport.

Presentations

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments Tutorial

New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR.

Greg is responsible for driving SQL product strategy as part of Cloudera’s data warehouse product team, including working directly with Impala. Over 20 years, Greg has worked with relational database systems across a variety of roles – including software engineering, database administration, database performance engineering, and most recently product management – providing a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Analytics in the cloud: Building a modern cloud-based big data warehouse Session

For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud.

Divya Ramachandran is the VP of Product at Captricity Inc. where she oversees the creation of effective and impactful product experiences for Captricity users. Her background is in human-centered design, which she practiced as a UX designer at a handful of startups for several years. She is most excited when she is dissecting complex problems to find and solve human painpoints. Divya holds her Ph.D. in Human Computer Interaction, from the University of California at Berkeley, where her research received awards from two prestigious academic conferences. Divya’s areas of interest lie at the intersection of data, design, international development and social good and data. In her free time, Divya loves to cook, experiment with music, and explore sunny East Bay parks with her two young kids.

Presentations

Automation and analytics enablement in life insurance Data Case Studies

Divya Ramachandran explains how top insurance companies have used handwriting transcription powered by deep learning to achieve a more than 70% reduction in daily operational processing time, develop a best-in-industry predictive model for assessing mortality risk from decades of archived forms, and enable a smarter claims leakage review, which led to a 10x ROI in its first year.

Mala Ramakrishnan heads product initiatives for Cloudera Altus – big data platform-as-a-service. She has 17+ years experience in product management, marketing, and software development in organizations of varied sizes that deliver middleware, software security, network optimization, and mobile computing. She holds a master’s degree in computer science from Stanford University.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Effectively once, exactly once, and more in Heron Session

Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and demonstrate how your applications will benefit from using them.

Modern real-time streaming architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Karthik Ramasamy leads a data science team at Uber focusing on solving fraud problems using machine learning. His team builds advanced machine learning models like semisupervised and deep learning models to detect account takeovers and stolen credit cards. Previously, Karthik was a cofounder of LogBase, where he worked on real-time analytics infrastructure and built models to rate drivers based on their driving behavior, and a founding member of the LinkedIn security team, where he developed various security products, with a particular focus on anti-automation efforts.

Presentations

Using computer vision to combat stolen credit card fraud Session

Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities.

Radhika Rangarajan is an engineering director for big data technologies within Intel’s Software and Services Group, where she manages several open source projects and partner engagements, specifically on Apache Spark and machine learning. Radhika is one of the cofounders and the director of the West Coast chapter of Women in Big Data, a grassroots community focused on strengthening the diversity in big data and analytics. Radhika holds both a bachelor’s and a master’s degree in computer science and engineering.

Presentations

Accelerating analytics and AI from the edge to the cloud (sponsored by Intel) Session

Advanced analytics and AI workloads require a scalable and optimized architecture, from hardware and storage to software and applications. Kevin Huiskes and Radhika Rangarajan share best practices for accelerating analytics and AI and explain how businesses globally are leveraging Intel’s technology portfolio, along with optimized frameworks and libraries, to build AI workloads at scale.

Delip Rao is the founder of R7 Speech Science, a San Francisco-based company focused on building innovative products on spoken conversations. Previously, Delip was the founder of Joostware, which specialized in consulting and building IP in natural language processing and deep learning. Delip is a well-cited researcher in natural language processing and machine learning and has worked at Google Research, Twitter, and Amazon (Echo) on various NLP problems. He is interested in building cost-effective, state-of-the-art AI solutions that scale well. Delip has an upcoming book on NLP and deep learning from O’Reilly.

Presentations

Machine learning with PyTorch 2-Day Training

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Machine learning with PyTorch (Day 2) Training Day 2

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Santosh Rao is a senior technical director in ONTAP engineering at NetApp, where he is responsible for NetApp’s vision and technology direction for data engineering. He works across a variety of data ecosystem partners and a broad range of customers to bring NetApp’s products to market in the data workload areas. Santosh has held a variety of roles at NetApp, including founding architect of clustered ONTAP for block enterprise apps, and has delivered a variety of virtualization, mobility, and protection products for enterprise app workloads. Previously, he was a master technologist at HP and delivered a number of first-generation products to market in the enterprise storage and systems space.

Presentations

Architecting an edge-to-cloud data pipeline to unify multiple data sources and processing engines (sponsored by NetApp) Session

Santosh Rao explores the architecture of a data pipeline from edge to core to cloud and across various data sources and processing engines and explains how to build a solution architecture that enables businesses to maximize the competitive differentiation with the ability to unify data insights in compelling yet efficient ways.

Radhika Ravirala is a solutions architect at Amazon Web Services, where she helps customers craft distributed, robust cloud applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. Radhika enjoys spending time with her family, walking her dog, doing Warrior X-Fit, and playing an occasional hand at Smash Bros.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Andrew Ray is a senior technical expert at Sam’s Club Technology. He is passionate about big data and has extensive experience working with Apache Spark and Hadoop. Previously, at Walmart, Andrew built an analytics platform on Hadoop that integrated data from multiple retail channels using fuzzy matching and distributed graph algorithms and led the adoption of Spark from proof of concept to production. He is an active contributor to the Apache Spark project, including SparkSQL and GraphX. Andrew holds a PhD in mathematics from the University of Nebraska, where he worked on extremal graph theory.

Presentations

Writing distributed graph algorithms Session

Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions.

Brandon Reeves is an investor at Lux Capital. His goal is to identify companies that are turning science fiction into science fact, especially in areas of autonomous systems, robotics, machine intelligence, and space.​​ Prior to joining Lux, he was on the investment team at Capricorn Investment Group in Palo Alto, California, where he focused on autonomy, mobility, space, and semiconductors. Previously, he worked for Texas Instruments in Sunnyvale, California, where he was a Field Applications Engineer that supported Tesla Motors and their autopilot and sensing teams. Brandon carries a BS in electrical engineering from the University of Central Florida and an MBA from Harvard Business School.

Presentations

Make data work: A VC panel discussion on prospectives and trends Session

To anticipate who will succeed and to invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, top-tier VCs look over the horizon to consider the big trends in how data is being put to work in startups and share what they think the field will look like in a few years (or more).

Joseph (Joey) Richards is vice president of data and analytics at GE Digital and head of the Wise.io data science applications team, which is responsible for defining and implementing machine learning applications on behalf of GE and its customers. Previously, he was cofounder and chief data scientist at Wise.io (acquired by GE in 2016), where he built and deployed high-value ML applications for dozens of customers; an NSF postdoctoral researcher in the Statistics and Astronomy Departments at UC Berkeley; and a Fulbright Scholar whose research focused on the applications of supervised and semisupervised learning for problems in astrophysics. Joey holds a PhD in statistics from Carnegie Mellon University.

Presentations

Machine learning applications for the industrial internet Session

Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas.

Randy Ridgley is a Solutions Architect on the Amazon Web Services Public Sector team. Previously, Randy worked for Walt Disney World in Orlando as the Principal Application Architect on their MagicBand platform, improving guest experience and cast coordination by building big data solutions based on AWS services. He has over 15 years of experience in the fields of Media & Entertainment, Casino Gaming and Publishing building real time streaming and big data analytics applications.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Alexis Roos is director of data science and machine learning at Salesforce, where he leads a team of data engineers and scientists focusing on deriving intelligence from activity data for the Einstein platform. Alexis has over 20 years of software engineering experience, with the last six years focused on large-scale data science and engineering using technologies including data engineering, entity resolution, distributed graph processing, machine learning, natural language processing, and deep learning. He has worked for SIs in Europe, Sun Microsystems/Oracle, and several startups, including Radius Intelligence, Concurrent, and Couchbase. Alexis started learning programming as a teenager and was an avid 68000 programmer. He is a frequent speaker at meetups and conferences such as Spark summit SF and East, Scala by the Bay, Hadoop Summit, O’Reilly Web 2.0, and Java One. He has also led trainings and two university-level courses on big data. Alexis is a mentor at thecamp. He holds a master’s degree in CS with a focus on cognitive sciences.

Presentations

Building a contacts graph from activity data Session

In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data.

Alex Rosenblat is a researcher and technical writer at the Data & Society Research Institute in NYC. A technology ethnographer trained in sociology, Alex studies how technology shapes changes in the workplace and why that transforms how we relate to one another in society. Her multidisciplinary scholarship spans across Uber’s drivers, algorithmic management, information and power asymmetries on employment platforms, and surveillance and accountability. She is the author of Uberland: How Algorithms Are Rewriting The Rules Of Work, forthcoming from the University of California Press in 2018. Her most recent work is available in the International Journal of Communications, the Columbia Law Review, Policy & Internet, and Surveillance & Society. Alex’s research has been featured in the New York Times, the Wall Street Journal, MIT Technology Review, the New Scientist, the Guardian, Vice, Motherboard, Fast Company, and CTV, and elsewhere, and she is an occasional contributor to Harvard Business Review, Fast Company, Motherboard, the Atlantic, and Pacific Standard. Alex holds a BA in history from McGill University and an MA in sociology from Queen’s University in Kingston, Canada.

Presentations

Workplace culture in the age of algorithmic management: The information networks Uber drivers built Session

Ride-hail drivers work alone, but they’re banding together online to compare notes, uncover new policies, and help each other navigate a workplace characterized by information scarcity. Alex Rosenblat explores how ride-hail workers are using online forums to create their own workplace culture as employment relationships grow more remote and algorithms replace human managers.

Steve Ross is the director of product management at Cloudera, where he focuses on security across the big data ecosystem, balancing the interests of citizens, data scientists, and IT teams working to get the most out of their data while preserving privacy and complying with the demands of information security and regulations. Previously, at RSA Security and Voltage Security, Steve managed product portfolios now in use by the largest global companies and hundreds of millions of users.

Presentations

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Meet the Expert with Mark Donsky and Steven Ross (Cloudera) Meet the Experts

Curious about GDPR? Stop by and chat with Mark and Steven about best practices for moving toward GDPR compliance, what other organizations are achieving, and how GDPR will impact your organization.

Philipp Rudiger is a software developer at Anaconda, where he develops open source and client-specific software solutions for data management, visualization, and analysis. Philipp holds a PhD in computational modeling of the visual system.

Presentations

Custom interactive visualizations and dashboards for one billion datapoints on a laptop in 30 lines of Python Tutorial

Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code.

Derek Ruths is cofounder and chief architect of CAI, a charity focused on bringing the power of data science to social good initiatives. Derek is also an associate professor of computer science at McGill University, the head of R&D at Data Sciences, and the director of the McGill Centre for Social and Cultural Data Science. In these capacities, he works closely with major tech companies, advises governments on technical innovation, teaches executive education programs, and partners with international humanitarian organizations. In his work and research, Derek has been a longtime advocate for the essential role of data science in fostering more equitable, more prosperous, and healthier organizations and societies.

Presentations

How to avoid pitfalls when reasoning with data Session

Unreasonable sales forecasts, badly overstocked inventory, misguided investments . . . bad analyses happen all the time, leading to bad decisions and costing businesses millions of dollars. Derek Ruths shares the five most common issues that lead to bad data-informed thinking.

Alexander Ryabov is head of data services for business intelligence at Wargaming. A leader in the definition, implementation, and operation of visionary architectures and solutions, Alexander has a passion for developing high-performing global matrix organizations.

Presentations

Winning the big data war pays big dividends for Wargaming (sponsored by SAS) Session

Alexander Ryabov and Jonathan Crow explain how Wargaming is winning the battle for bigger profits in the virtual world of online gaming using a best-in-class business intelligence solution to equip its business units with decision-making tools.

Stefan Salandy is a systems engineer at Cloudera.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Michael Schrenk has developed software that collects and processes information for some of the biggest news agencies in Europe and leads a competitive intelligence consultancy in Las Vegas, where he consults on information security everywhere from Moscow to Silicon Valley, and most places in between. Mike is the author of Webbots, Spiders, and Screen Scrapers. He has lectured at journalism conferences in Belgium and the Netherlands and has created several weekend data workshops for the Centre for Investigative Journalism in London. Along the way, he’s been interviewed by BBC, the Christian Science Monitor, National Public Radio, and many others. Mike is also an eight-time speaker at the notorious DEF CON hacking conference. He may be best known for building software that over a period of a few months autonomously purchased over $13 million dollars worth of cars by adapting to real-time market conditions.

Presentations

Understanding metadata Session

Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning with TensorFlow 2-Day Training

The TensorFlow library enables the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs, making it ideal for implementing neural networks and other machine learning algorithms. Robert Schroll demonstrates TensorFlow's capabilities and walks you through building machine learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

The instructors demonstrate TensorFlow's capabilities through its Python interface and explore TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine learning models on real-world data.

Baron Schwartz is the founder and CTO of VividCortex, the best way to see what your production database servers are doing. Baron has written a lot of open source software and several books, including High Performance MySQL. He’s focused his career on learning and teaching about performance and observability of systems generally, including the view that teams are systems and culture influences their performance, and databases specifically.

Presentations

Why nobody cares about your anomaly detection Session

Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. He is passionate about building combined big data and blockchain solutions. Over his career, Jim has held positions running operations, engineering, architecture, and QA teams in the financial services, regulatory, digital advertising, IoT, manufacturing, healthcare, chemicals, and geographical management systems industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

The changing role of the CDO: Three keys for success (sponsored by MapR) Session

The value of data is not strictly a function of its size but rather is in the value that can be extracted from it. Jim Scott explains how to identify the right data to leverage to monitor the pulse of fast changing business environments, the best way to integrate analytics into your business processes, and the importance of cross-application data flows.

Skipper is Director of Data Science R&D and a Product Lead at Civis Analytics in Chicago. He leads a team of data scientists from all walks of life from physicists and biologists to statisticians and computer scientists. Together they drive the data science behind the products Civis offers and push the capabilities of solutions that Civis provides to its clients. He is an economist by training and has a decade of experience working in the Python data open source community. He started and led the statsmodels Python project, was formerly on the core pandas team, and has contributed to many projects in Python data stack. He holds strong opinions about writing and barbecue.

Presentations

Building a data science idea factory: How to prioritize the portfolio of a large, diverse, and opinionated data science team Session

A huge challenge for data science managers is determining priorities for their teams, which often have more good ideas than they have time. Katie Malone and Skipper Seabold share a framework that their large and diverse data science team uses to identify, discuss, select, and manage data science projects for a fast-moving startup.

Paul Sears is a solutions architect supporting AWS partners in the big data space.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Nikita Shamgunov is CTO at MemSQL. Preivously, Nikita was a senior database engineer for Microsoft’s SQL Server. He has been awarded several patents and was a world medalist in ACM programming contests. Nikita holds a BS, MS, and PhD in computer science.

Presentations

Building the foundation of a latency-free life (sponsored by MemSQL) Keynote

We live in a world that’s always connected. As a result, today’s intelligent applications need to react immediately to changing conditions. To achieve this, applications require a foundation that is latency free. Nikita Shamgunov shares a vision of latency-free life supported by modern data architectures.

Janelle Shane’s neural network blog, AIweirdness.com, features computer programs that try to invent human things like recipes and paint colors and Halloween costumes. AIweirdness.com has been covered in the Guardian, the Atlantic, NBC News, and Slate and was even featured as a recent quiz question on Wait Wait, Don’t Tell Me. Janelle also works as a research scientist in Colorado, where she makes computer-controlled holograms for studying the brain. She has only made a neural network recipe once and discovered that horseradish brownies are about as terrible as you might imagine.

Presentations

Sprouted clams and stanky bean: When machine learning makes mistakes Keynote

At AIweirdness.com Janelle Shane posts the results of neural network experiments gone delightfully wrong. But machine learning mistakes can also be very embarrassing or even dangerous. Using silly datasets as examples, Janelle talks about some ways that algorithms fail.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementations. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

Meet the Expert with Gwen Shapira (Confluent) Meet the Experts

Gwen's Kafka expertise could be invaluable in answering your questions about: Apache Kafka internals, when Apache Kafka is not a good fit for a use-case, Apache Kafka and Cloud Native architectures, or Apache Kafka and IOT/Edge architectures.

The future of ETL isn’t what it used to be Session

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve.

The future of ETL isn’t what it used to be Session

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve.

Ashivni Shekhawat is a data scientist at Lyft working on pricing. Ashivni has developed several algorithms for dynamic pricing, online learning and estimation at Lyft. Ashivni is deeply interested in statistical inference, experimentation, and machine learning. Ashivni comes from a physics and engineering background, and as conducted research and graduate studies at UC Berkeley, Cornell, and IIT Kanpur. He holds degrees in Aerospace Engineering, Physics and Applied Mechanics

Presentations

Approaching the pricing problem at Lyft Session

Ashivni Shekhawat explains how Lyft uses a mix of online learning, optimization, and control theory to operate its ride-sharing marketplace at an efficient price point.

Min Shen is an engineer on LinkedIn’s Hadoop infrastructure development team helping to build next-generation Hadoop infrastructure at LinkedIn with better performance and manageability. Min holds a PhD in computer science from the University of Illinois, where he focused on distributed computing.

Presentations

Metrics-driven tuning of Apache Spark at scale Session

Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

An effective Healthcare IT leader focused on aligning business strategy with leadership, organizational, technology and culture needs to evolve the data management and analytics landscape. Experience with designing and orchestrating various solutions and processes to increase efficiency, access to data and organizational transparency.

Provides a unique perspective by combining experiences and best practices from multiple industries to drive change, adoption and innovation within Healthcare.

Presentations

Building a flu predictor model for improved patient care Data Case Studies

As healthcare data becomes increasingly digitized, medical centers are able to leverage data in new ways to improve patient care. Jennie Shin explains how Kaiser Permanente developed a sophisticated flu predictor model to better determine where resources were needed and how to reduce outbreaks.

Tomer Shiran is the CEO and cofounder of Dremio. Previously, he was vice president of product at MapR, where he was responsible for product strategy, roadmap, and new feature development, and as a member of the executive team, helped grow the company from 5 to 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. Tomer is the founder of the open source Apache Drill project. He holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion, the Israel Institute of Technology. He has authored five US patents.

Presentations

Data reflections: Making data fast and easy to use without making copies Session

Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies.

Tomas Singliar is a data scientist in Microsoft’s AI and Research Group. Tomas’s favorite hammer is probabilistic and Bayesian modeling, which he applies analytically and predictively to business data. He has published a dozen papers in and serves as reviewer for several top tier AI conferences, including AAAI and UAI, and holds four patents in intent recognition through inverse reinforcement learning. Tomas studied machine learning at University of Pittsburgh.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Ram Shankar is a security data wrangler in Azure Security Data Science, where he works on the intersection of ML and security. Ram’s work at Microsoft includes a slew of patents in the large intrusion detection space (called “fundamental and groundbreaking” by evaluators). In addition, he has given talks in internal conferences and received Microsoft’s Engineering Excellence award. Ram has previously spoken at data-analytics-focused conferences like Strata San Jose and the Practice of Machine Learning as well as at security-focused conferences like BlueHat, DerbyCon, FireEye Security Summit (MIRCon), and Infiltrate. Ram graduated from Carnegie Mellon University with master’s degrees in both ECE and innovation management.

Presentations

Failed experiments in infrastructure security analytics and lessons learned from fixing them Session

How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process.

Crystal Skelton is an associate in Kelley Drye & Warren’s Los Angeles office, where she represents a wide array of clients from tech startups to established companies in privacy and data security, advertising and marketing, and consumer protection matters. Crystal advises clients on privacy, data security, and other consumer protection matters, specifically focusing on issues involving children’s privacy, mobile apps, data breach notification, and other emerging technologies and counsels clients on conducting practices in compliance with the FTC Act, the Children’s Online Privacy Protection Act (COPPA), the Gramm-Leach-Bliley Act, the GLB Safeguards Rule, Fair Credit Reporting Act (FCRA), the Fair and Accurate Credit Transactions Act (FACTA), and state privacy and information security laws. She regularly drafts privacy policies and terms of use for websites, mobile applications, and other connected devices.

Crystal also helps advertisers and manufacturers balance legal risks and business objectives to minimize the potential for regulator, competitor, or consumer challenge while still executing a successful campaign. Her advertising and marketing experience includes counseling clients on issues involved in environmental marketing, marketing to children, online behavioral advertising (OBA), commercial email messages, endorsements and testimonials, food marketing, and alcoholic beverage advertising. She represents clients in advertising substantiation proceedings and other matters before the Federal Trade Commission (FTC), the US Food and Drug Administration (FDA), and the Alcohol and Tobacco Tax and Trade Bureau (TTB) as well as in advertiser or competitor challenges before the National Advertising Division (NAD) of the Council of Better Business Bureaus. In addition, she assists clients in complying with accessibility standards and regulations implementing the Americans with Disabilities Act (ADA), including counseling companies on website accessibility and advertising and technical compliance issues for commercial and residential products. Prior to joining Kelley Drye, Crystal practiced privacy, advertising, and transactional law at a highly regarded firm in Washington, DC, and as a law clerk at a well-respected complex commercial and environmental litigation law firm in Los Angeles, CA. Previously, she worked at the law firm featured in the movie Erin Brockovich, where she worked directly with Erin Brockovich and the firm’s name partner to review potential new cases.

Presentations

Executive Briefing: Legal best practices for making data work Session

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.”

A data scientist and entrepreneur focused on building intelligent systems to collect information and enable better decisions, Peter Skomoroch is currently cofounder and CEO of SkipFlag (recently acquired by Workday). Pete specializes in solving hard algorithmic problems, leading cross-functional teams, and developing engaging products powered by data and machine learning. Previously, he applied his skills to the consumer internet space at LinkedIn, the world’s largest professional network, where he was an early member of the data science team. As principal data scientist, he led data science teams focused on reputation, search, inferred identity, and building data products. He was also the creator of LinkedIn Skills and LinkedIn Endorsements.

Presentations

Who has better taste, machines or humans? Media and Ad Tech

Algorithms decide what we see, what we listen to, what news we consume, and myriad other decisions each day. But while they can make many things more efficient, can they outperform humans in areas where the "right" outcome can't be clearly defined? In this Oxford-style debate, two teams will face off, arguing whether or not machines have better taste than humans.

Jeff Smits is vice president of IT and business services at RingCentral. Jeff has been building enterprise solutions since he started his Silicon Valley career at HP and has held vice president of 
IT roles at companies including Salesforce, McAfee, and DocuSign. He is passionate about operational excellence and empowering users through cloud computing. Having run big data, BI development, and data warehouses at several companies, Jeff brings an appreciation for how companies harvest the value of their data assets. He is focused on driving value quickly and making it easy for the business to scale. Jeff takes a special interest in enabling decision makers and connecting systems.

Presentations

Harnessing the cloud to enable connected systems and self-service and accelerate business growth (sponsored by Talend) Session

Jeff Smits explains how RingCentral is utilizing the cloud, data integration, self-service, and APIs to harvest the immense potential of connected systems.

Alex Smola is director of machine learning at Amazon.

Presentations

Data science in the cloud Keynote

In this talk Alex will discuss lessons learned from AWS SageMaker, an integrated framework for handling all stages of analysis. AWS uses open source components such as Jupyter, Docker containers, Python and well established deep learning frameworks such as Apache MxNet and TensorFlow for an easy to learn workflow.

Presentations

Leveraging live data to realize the smart cities vision Session

One of the key application domains leveraging live data is smart cities, but success depends on the availability of generic platforms that support high throughput and ultralow latency. Arun Kejariwal and Francois Orsini offer an overview of Satori's live data platform and walk you through a country-scale case study of its implementation.

Suqiang Song is director and chapter leader at Mastercard, where directly oversees a team embedded within the data engineering and AI tribe. Suqiang blends deep business and technical expertise with a passion for coaching people, helping them grow and develop in their area of expertise and ensuring alignment on the “how” of the work they perform in squads.

Presentations

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale Session

Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach.

Evan Sparks is cofounder and CEO of Determined AI, a software company that makes machine learning engineers and data scientists fantastically more productive. Previously, Evan worked in quantitative finance and web intelligence. He holds a PhD in computer science from UC Berkeley, where, as a member of the AMPLab, he contributed to the design and implementation of much of the large-scale machine learning ecosystem around Apache Spark, including MLlib and KeystoneML. He also holds an AB in computer science from Dartmouth College.

Presentations

Taming deep learning Session

Deep learning has shown tremendous improvements in a number of areas and has justifiably generated enormous excitement. However, several key challenges—from prohibitive hardware requirements to immature software offerings—are impeding widespread enterprise adoption. Evan Sparks details fundamental challenges facing organizations looking to adopt deep learning and shares possible solutions.

Ram Sriharsha is the product manager for Apache Spark at Databricks and an Apache Spark committer and PMC member. Previously, Ram was architect of Spark and data science at Hortonworks and principal research scientist at Yahoo Labs, where he worked on scalable machine learning and data science. He holds a PhD in theoretical physics from the University of Maryland and a BTech in electronics from the Indian Institute of Technology, Madras.

Presentations

Magellan: Scalable and fast geospatial analytics Session

How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity.

Seth Stephens-Davidowitz uses data from the internet (particularly Google searches) to get new insights into the human psyche, measuring racism, self-induced abortion, depression, child abuse, hateful mobs, the science of humor, sexual preference, anxiety, son preference, and sexual insecurity, among many other topics. His 2017 book, Everybody Lies, published by HarperCollins, was a New York Times best seller. Seth is also a contributing op-ed writer for the New York Times. Previously, he was a data scientist at Google and a visiting lecturer at the Wharton School at the University of Pennsylvania. He holds a BA in philosophy (Phi Beta Kappa) from Stanford and a PhD in economics from Harvard. In high school, Seth wrote obituaries for his local newspaper, the Bergen Record, and was a juggler in theatrical shows. He now lives in Brooklyn and is a passionate fan of the Mets, Knicks, and Jets, Stanford football, and Leonard Cohen.

Presentations

Lessons in Google Search data Keynote

Seth Stephens-Davidowitz explains how to use Google searches to uncover behaviors or attitudes that may be hidden from traditional surveys, such as racism, sexuality, child abuse, and abortion.

Maryna Strelchuk is an Information Architect and Application Developer at ING. She has a background in software development and Artificial Intelligence. Currently, she is involved in the Open Metadata initiative, including Apache Atlas.

Presentations

The rise of big data governance: Insight on this emerging trend from active open source initiatives Session

John Mertic and Maryna Strelchuk detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share how companies like ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative.

Kapil Surlaker leads the data and analytics team at LinkedIn, where he’s responsible for core analytics infrastructure platforms, including Hadoop, Spark, computation frameworks like Gobblin and Pinot, an OLAP serving store, and XLNT, LinkedIn’s experimentation platform. Previously, Kapil led the development of Databus, a database change capture platform that forms the backbone of LinkedIn’s online data ecosystem; Espresso, a distributed document store that powers many applications on the site; and Helix, a generic cluster management framework that manages multiple infrastructure deployments at LinkedIn. Prior to LinkedIn, Kapil held senior technical leadership positions at Kickfire (acquired by Teradata) and Oracle. Kapil holds a BTech in computer science from IIT Bombay and an MS from the University of Minnesota.

Presentations

If you can’t measure it, you can’t improve it: How reporting and experimentation fuel product innovation at LinkedIn Session

Metrics measurement and experimentation play crucial roles in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data.

Ian Swanson is the CEO of DataScience.com. An expert in big data and analytics, an accomplished entrepreneur, and a successful executive for such Fortune 500 companies as American Express and Sprint, Ian is at home in both startups and enterprise-level organizations. Previously, he founded Sometrics (acquired by American Express in 2011), which launched the industry’s first global virtual currency platform. That platform—for which he holds a patent—managed more than 3.3 trillion units of virtual currency and served an online audience of 250 million in more than 180 countries. Prior to Sometrics, Ian worked for the secure chat and messaging startup Userplane (acquired by AOL). A sought-after speaker on data science, the internet of things, big data, and performance-based analytics, he advises a number companies on their product and marketing strategies and serves as a mentor to the Los Angeles startup incubators Amplify and Launchpad LA. Ian won the 2013 American Express Chairman’s Award and was twice recognized as one of Direct Marketing News’s 30 under 30. He attended the University of California, Santa Barbara.

Presentations

Digital transformation demands faster, more productive data science (sponsored by DataScience.com) Session

Ian Swanson shares strategies for leading more productive data science teams, along with steps you can take today to meet growing demands for AI and machine learning use cases.

Pawel Szostek is a senior software engineer on Criteo’s analytics data storage team, where he works on various projects, including implementing an improved HyperLogLog algorithm. Previously, he was a researcher at CERN in Geneva.

Presentations

Hive as a service Session

Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.

Ran Taig is a senior data scientist at Dell EMC, where he leads data science projects, especially in domain of hardware failure prediction, and plays a key roll in designing the team engagement models and work structure, serving as a consultant to EMC’s business data lake team. Ran is also responsible for the team’s academic relations and continues to teach theory courses for CS students. Previously, Ran taught the Design of Algorithms and other CS theory courses at Ben-Gurion University. He holds a PhD in computer science from Ben-Gurion University, Israel, where he specialized in artificial intelligence. His research mainly focused on automated planning.

Presentations

AI-powered crime prediction Session

What if we could predict when and where crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Spark NLP in action: Improving patient flow forecasting at Kaiser Permanente Session

David Talby and Santosh Kulkarni explain how Kaiser Permanente uses the open source NLP library for Apache Spark to tackle one of the most common challenges with applying natural language process in practice: integrating domain-specific NLP as part of a scalable, performant, measurable, and reproducible machine learning pipeline.

Yulia Tell is a technical program manager on the big data technologies team within the Software and Services Group at Intel, where she is working on several open source projects and partner engagements in the big data domain. Yulia’s work is focused specifically on Apache Hadoop and Apache Spark, including big data analytics applications that use machine learning and deep learning. Yulia holds an MSc in computer science from Moscow Power Engineering Technical University and has completed executive training on market driving strategies at London Business School.

Presentations

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

Daniel Templeton has a long history in high-performance computing, open source communities, and technology evangelism. Today Daniel works on the YARN development team at Cloudera, focused on the resource manager, fair scheduler, and Docker support.

Presentations

What's new in Hadoop 3.0 Session

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Siddharth Teotia is a software engineer at Dremio and a PMC for Apache Arrow project. Previously, Siddharth was a member of database kernel team at Oracle, where he worked on storage, indexing, and the in-memory columnar query processing layers of Oracle RDBMS. He holds an MS in software engineering from CMU and a BS in information systems from BITS Pilani, India. During his studies, Siddharth focused on distributed systems, databases, and software architecture.

Presentations

Vectorized query processing using Apache Arrow Session

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.

Tobias Ternstrom is the product management leader for open source databases and database migrations in Microsoft Azure, where he is responsible for supporting Microsoft’s customers betting on open source databases for their business. His team builds Azure database services for MariaDB, MySQL, and PostgreSQL as well as the Azure Database Migration Service. Since joining Microsoft, he has made critical contributions to the Microsoft database business focused on the SQL Server database engine and the database service Azure SQL Database, based on SQL Server. Previously, Tobias ran product management for bringing Microsoft SQL Server to Linux. Prior to Microsoft, he was a product manager at MemSQL and founded a few startups in the areas of software development consulting services for enterprises, SaaS services for personality testing and personnel development, and point-of-sale systems. He lives in Redmond with his wife and two children.

Presentations

Focus on your business: Case studies on building data solutions that meet your needs (sponsored by Microsoft) Session

Tobias Ternstrom leads a deep dive into case studies from three Microsoft customers who put technology before solutions. Tobias examines the decisions that brought them there and outlines how they got back on track and solved their business problems.

To a hammer, everything is a nail: Choosing the right tool for your business problems (sponsored by Microsoft) Keynote

The emergence of the cloud combined with open source software ushered in an explosive use of a broad range of technologies. Tobias Ternstrom explains why you should step back and attempt to objectively evaluate the problem you are trying to solve before choosing the tool to fix it.

Anjali Thakur part of Accenture’s Applied Intelligence practice, where she collaborates with Accenture’s AI ecosystem of partners, startups, and a network of internal and external organizations to bring AI transformation to Accenture’s clients globally.

Presentations

Executive Briefing: The rise of the ecosystem Session

Whether you are a technology or a services provider, understanding your value in the ecosystem and focusing on the right partners to reach your market goals is critical. Anjali Thakur shares examples of teaming models and leading practices for accelerating value from your ecosystem strategy.

Meena Thandavarayan is a practice lead at Infosys, where he focuses on leveraging technical advancements and industry reference architectures for defining a data delivery platform. Meena has extensive experience leading application, technology, data, and infrastructure teams developing strategy, architecture, implementation, and IT operational services. A big data and analytics evangelist, he specializes in strategy for accelerating the digitization journey for oil and gas clients: most recently, he delivered functional and technical architecture for a one-stop self-service data and information portal.

Presentations

Meta your data; drain the big data swamp Data Case Studies

Madhav Madaboosi and Meenakshisundaram Thandavarayan offer an overview of BP's self-service operational data lake, which improved operational efficiency, boosting productivity through fully identifiable data and reducing risk of a data swamp. They cover the path and big data technologies that BP chose, lessons learned, and pitfalls encountered along the way.

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Wee Hyong Tok is a principal data science manager with the AI CTO office at Microsoft, where he leads the engineering and data science team for the AI for Earth program. Wee Hyong has worn many hats in his career, including developer, program and product manager, data scientist, researcher, and strategist, and his track record of leading successful engineering and data science teams has given him unique superpowers to be a trusted AI advisor to customers. Wee Hyong coauthored several books on artificial intelligence, including Predictive Analytics Using Azure Machine Learning and Doing Data Science with SQL Server. Wee Hyong holds a PhD in computer science from the National University of Singapore.

Presentations

How does a big data professional get started with AI? Session

Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI.

Carlo Torniai is head of data science and analytics at Pirelli. An accomplished data scientist with experience ranging across various areas of computer science and information technology, Carlo has extensive experience in data modeling, data analysis, and data engineering, and Python in the data science space (e.g., pandas, scipy, scikit-learn). Previously, he was a staff data scientist at Tesla Motors. He holds a PhD in informatics from the Università degli Studi di Firenze, Italy.

Presentations

Pirelli Connesso: Where the road meets the cloud Session

Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of contributions across cross-functional teams.

Ayin Vala is the founder of DeepMD and cofounder and chief data scientist at the nonprofit organization Foundation for Precision Medicine, where he and his research and development team work on statistical analysis and machine learning, pharmacogenetics, molecular medicine, and sciences relevant to the advancement of medicine and healthcare delivery. Ayin has won several awards and patents in the healthcare, aerospace, energy, and education sectors. Ayin holds master’s degrees in information management systems from Harvard University and mechanical engineering from Georgia Tech.

Presentations

Reinventing healthcare: Early detection of Alzheimer’s disease with deep learning Session

Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient.

William Vambenepe leads the product management team responsible for big data services on Google Cloud Platform.

Presentations

What separates the clouds? (sponsored by Google Cloud) Keynote

William Vambenepe explains how a pivot toward machine learning and artificial intelligence has created clearer separation among clouds than ever before. William walks you through an interesting use case of machine learning in action and discusses the central role AI will play in big data analysis moving forward.

Presentations

Fighting sex trafficking with data science Session

Sugreev Chawla offers an overview of Spotlight, a tool created by Thorn, a nonprofit that uses technology to fight online child sexual exploitation. It allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking, using graph analysis, time series analysis, and NLP techniques to surface important networks of ads and characterize their behavior over time.

Vinithra Varadharajan is a senior engineering manager in the cloud organization at Cloudera, where she is responsible for the cloud portfolio products, including Altus Data Engineering, Altus Analytic Database, Altus SDX, and Cloudera Director. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Emre Velipasaoglu is principal data scientist at Lightbend. A machined learning expert, Emre previously served as principal scientist and senior manager at Yahoo! Labs. He has authored 23 peer-reviewed publications and nine patents in search, machine learning, and data mining. Emre holds a PhD in electrical and computer engineering from Purdue University and completed postdoctoral training at Baylor College of Medicine.

Presentations

Machine-learned model quality monitoring in fast data and streaming applications Session

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications.

Software Engineer on the Cloudera Altus team.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Shivaram Venkataraman is a postdoctoral researcher at Microsoft Research. Starting in Fall 2018, he will be an assistant professor in computer science at the University of Wisconsin-Madison. Shivaram holds a PhD from the University of California, Berkeley, where he was advised by Mike Franklin and Ion Stoica. His work spans distributed systems, operating systems, and machine learning, and his recent research has looked at designing systems and algorithms for large-scale data analysis.

Presentations

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling Session

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.

Josef Viehhauser is a full stack data scientist at the BMW Group, where he leverages machine learning to create data-driven applications and improve established workflows along the company’s value chain. Josef also works on scoping and implementing such use cases in scalable ecosystems primarily via Python. Outside of work, he is interested in technological innovations and soccer.

Presentations

Data-driven ecosystems in the automotive industry Session

The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Josef Viehhauser and Tobias Bürger discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the creation of the Lightbend Fast Data Platform, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Ask Me Anything: Streaming architectures and applications (Kafka, Spark, Akka, and microservices) Session

Join Dean Wampler and Boris Lublinsky to discuss all things streaming, from architecture and implementation to streaming engines and frameworks. Be sure to bring your questions about techniques for serving machine learning models in production, traditional big data systems, or software architecture in general.

Kafka streaming applications with Akka Streams and Kafka Streams Session

Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead.

Meet the Expert with Dean Wampler (Lightbend) Meet the Experts

Join Dean to discuss all things streaming, from new architectures that integrate more closely with microservices, using Kafka as the data backplane, and Spark and Flink for large-scale data streams where sophisticated handling is required to using Akka Streams and Kafka Streams, runtime platforms like Mesos, Kubernetes, and YARN, and alternatives to all the above.

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams Tutorial

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead.

Andrew Wang is a software engineer on the HDFS team at Cloudera. Previously, he was a graduate student in the AMPLab at the University of California, Berkeley, advised by Ion Stoica, where he worked on research related to in-memory caching and quality of service. In his spare time, he enjoys going on bike rides, cooking, and playing guitar.

Presentations

What's new in Hadoop 3.0 Session

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Jason Wang is a software engineer at Cloudera focusing on the cloud.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Jiao (Jennie) Wang is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She is engaged in developing and optimizing distributed deep learning framework on Apache Spark.

Presentations

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.

Presentations

Playing well together: Big data beyond the JVM with Spark and friends Session

Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).

Jennifer Webb is vice president of development and operations at SuprFanz. Jennifer has over 10 years experience as a website and application developer for large and small companies, including major banks, and as a keyboardist in rock bands in Toronto, Calgary, and Vancouver.

Presentations

Data science in practice: Examining events in social media Media and Ad Tech

Jennifer Webb explains how cloud-based marketing company SuprFanz uses data science techniques and graph theory with Neo4j to generate live event attendance from social media platforms, email, and SMS.

Brooke Wenig is an instructor and data science consultant for Databricks. Previously, she was a teaching associate at UCLA, where she taught graduate machine learning, senior software engineering, and introductory programming courses. Brooke also worked at Splunk and Under Armour as a KPCB fellow. She holds an MS in computer science with highest honors from UCLA with a focus on distributed machine learning. Brooke speaks Mandarin Chinese fluently and enjoys cycling.

Presentations

Apache Spark programming 2-Day Training

Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.

Apache Spark programming (Day 2) Training Day 2

Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.

Josh Wills is a software engineer on Slack’s search, learning, and intelligence team. Previously, Josh built data teams, products, and infrastructure at Google and Cloudera. He is the founder and vice president of the Apache Crunch project for creating optimized MapReduce pipelines in Java and lead developer of Cloudera ML, a set of open source libraries and command-line tools for building machine learning models on Hadoop. Josh is a coauthor of Advanced Analytics with Spark. He is also known for his pithy definition of a data scientist as “someone who is better at software engineering than any statistician and better at statistics than any software engineer.”

Presentations

Data science at Slack Session

Josh Wills describes recent data science and machine learning projects at Slack.

Weisheng Vincent Xie (谢巍盛) is the technical director at Orange Finance, a FinTech company where he helps bring up big data service of analytics, streaming and machine learning for various applications. Previously, he was a senior software engineer and machine learning scientist at Intel working on machine learning- and big data-related technologies and the tech lead of Intel’s machine learning engineering team. Vincent is an active contributor to open source communities. He has rich experience in applied machine learning technologies in the fields of healthcare, fintech, and ecommerce.

Presentations

Spark ML optimization at Intel: A case study Session

Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on with Spark ML and introduce the methodology behind Intel's work on SparkML optimization.

Jerry Xu is cofounder and CTO at Datatron Technologies. An innovative software engineer with extensive programming and design experience in storage systems, online services, mobile, distributed systems, virtualization, and OS kernels, Jerry also has a demonstrated ability to direct and motivate a team of software engineers to complete projects meeting specifications and deadlines. Previously, he worked at Zynga, Twitter, Box, and Lyft, where he built the company’s ETA machine learning model. Jerry is the author of open source project LibCrunch. He is a three-time Microsoft Gold Star Award winner.

Presentations

Lessons learned deploying machine learning and deep learning models in production at major tech companies Session

Deploying machine learning models and deep learning models in production is hard. Harish Doddi and Jerry Xu outline the enterprise data science lifecycle, covering how production model deployment flow works, challenges, best practices, and lessons learned. Along the way, they explain why monitoring models in the production should be mandatory.

Ya Xu is principal staff engineer and statistician at LinkedIn, where she leads a team of engineers and data scientists building a world-class online A/B testing platform. She also spearheads taking LinkedIn’s A/B testing culture to the next level by evangelizing best practices and pushing for broad-based platform adoption. She holds a PhD in statistics from Stanford University.

Presentations

If you can’t measure it, you can’t improve it: How reporting and experimentation fuel product innovation at LinkedIn Session

Metrics measurement and experimentation play crucial roles in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data.

Yu Xu is the founder and CEO of TigerGraph, the world’s first native parallel graph database. He is an expert in big data and parallel database systems and has over 26 patents in parallel data management and optimization. Previously, Yu worked on Twitter’s data infrastructure for massive data analytics and was Teradata’s Hadoop architect leading the company’s big data initiatives. Yu holds a PhD in computer science and engineering from the University of California, San Diego.

Presentations

Real-time deep link analytics: The next stage of graph analytics Session

Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups.

Fabian Yamaguchi is the chief scientist at ShiftLeft. Fabian has over 10 years of experience in the security domain, where he has worked as a security consultant and researcher focusing on manual and automated vulnerability discovery. He has identified previously unknown vulnerabilities in popular system components and applications such as the Microsoft Windows kernel, the Linux kernel, the Squid proxy server, and the VLC media player. Fabian is a frequent speaker at major industry conferences such as Black Hat USA, DEF CON, First, and CCC and renowned academic security conferences such as ACSAC, Security and Privacy, and CCS. He holds a master’s degree in computer engineering from Technical University Berlin and a PhD in computer science from the University of Goettingen.

Presentations

Code Property Graph: A modern, queryable data storage for source code Session

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.

Shenghu Yang is an engineering manager at Lyft, where he was a founding member of the company’s data platform team and now runs the data tools team. Previously, Shenghu worked at Oracle and @WalmartLabs on cloud computing and digital marketing-related engineering work. He holds an MS from Carnegie Mellon University.

Presentations

Lyft's analytics pipeline: From Redshift to Apache Hive and Presto Session

Lyft’s business has grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits.

Chuck Yarbrough is vice president of solutions marketing and management at leading IoT and big data analytics company Hitachi Vantara, where he is responsible for creating and driving repeatable solutions that leverage Hitachi Vantara’s Pentaho platform, enabling customers to implement IoT and big data solutions that transform companies into data-driven enterprises. Chuck has more than 20 years of experience helping organizations use data to ensure they can run, manage, and transform their business through better use of data. Previously, Chuck held leadership roles at Deloitte, SAP, and Hyperion.

Presentations

Managing the intelligent data pipeline and the connected enterprise (sponsored by Hitachi Vantara) Session

Intelligently managing the data pipeline is the key to driving business acceleration and reducing costs. Chuck Yarbrough outlines ways to gain control over the data pipeline. Along the way, you’ll learn how cloud, big data, and machine learning models intersect and how streaming and cloud integration can help create the connected enterprise.

Yi Yin is a software engineer on the data engineering team at Pinterest, where he works on Kafka-to-S3 persisting tools and schema generation of Pinterest’s data.

Presentations

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously Session

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

Tendü Yoğurtçu is the chief technology officer at Syncsort, where she directs the company’s technology strategy and innovation and leads all product research and development programs. Tendü has 20+ years of software industry experience, including extensive big data and Hadoop industry knowledge. Previously, Tendü was Syncsort’s general manager of big data, where she led the global software business for data integration, Hadoop, and the cloud, including sales, marketing, engineering, and support, and held several engineering management roles where she directed the development of ETL, sort, and application modernization products for Syncsort’s data integration business. She was also an adjunct faculty member in the Computer Science Department at Stevens Institute of Technology. Tendü is a dedicated advocate for STEM education for women and diversity. She holds a PhD in computer science from Stevens Institute of Technology in NJ, a master’s degree in industrial engineering, and a bachelor’s degree in computer engineering from Bosphorus University, Istanbul.

Presentations

Get a farm-to-table view of your data: Track data lineage from source to analytics (sponsored by Syncsort) Session

Chefs must be able to trust the authenticity, quality, and origin of their ingredients; data analysts must be able to do the same of their data—and what happens to it along the way. Tendü Yoğurtçu explains how to seamlessly track the lineage and quality of your data—on and off the cluster, on-premises or in the cloud—to deliver meaningful insights and meet regulatory compliance requirements.

Juan Yu is a software engineer at Cloudera working on the Impala project, where she helps customers investigate, troubleshoot, and resolve escalations and analyzes performance issues to identify bottlenecks, failure points, and security holes. Juan also implements enhancements in Impala to improve customer experience. Previously, Juan was a software engineer at Interactive Intelligence and held developer positions at Bluestreak, Gameloft, and Engenuity.

Presentations

How to use Impala's query plan and profile to fix performance issues Tutorial

Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.

Ali Zaidi is data scientist in Microsoft’s AI and Research Group, where he spends his day trying to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Previously, Ali was a research associate at NERA (National Economic Research Associates), providing statistical expertise on financial risk, securities valuation, and asset pricing. He studied statistics at the University of Toronto and computer science at Stanford University.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Ye Zhou is a software engineer in LinkedIn’s Hadoop infrastructure development team, mostly focusing on Hadoop Yarn and Spark related projects. Ye holds a master’s degree in computer science from Carnegie Mellon University.

Presentations

Metrics-driven tuning of Apache Spark at scale Session

Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.