Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Vivek Agate is a staff software engineer at FireEye with 8+ years of experience in software design and development in various Java technologies.

Presentations

FireEye's journey migrating 25 TB of RDBMS data to Hadoop Session

Ganesh Prabhu and Vivek Agate present an approach that enabled a small team at FireEye to migrate 20 TB of RDBMS data comprised of over 250 tables and nearly 2,000 partitions data to Hadoop and present an adaptive platform that allows migration of a rapidly changing dataset to Hive. Along the way, they share some of the challenges typical for a company embarking at Hadoop implementation.

John Mark Agosta is a principal data scientist in IMML at Microsoft. Over his career, he has worked with startups and labs in the Bay Area, including the original Knowledge Industries, and was a researcher at Intel Labs, where he was awarded a Santa Fe Institute Business Fellowship in 2007, and at SRI International after receiving his PhD from Stanford. He has participated in the annual Uncertainty in AI conference since its inception in 1985, proving his dedication to probability and its applications. When feeling low he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Ashvin Agrawal is a senior research engineer at Microsoft, where he works on streaming systems and contributes to the Twitter Heron project. Ashvin is a software engineer with more than 10+ years experience. He specializes in developing large-scale distributed systems. Previously, he worked at VMware, Yahoo, and Mojo Networks. Ashvin holds an MTech in computer science from IIT Kanpur, India.

Presentations

From rivulets to rivers: Elastic stream processing in Heron Session

Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience.

Shekhar Agrawal is the director of data science at Comcast. Shekhar is an expert data scientist with specialization in the text and NLP fields. He currently handles several PB-scale modeling initiatives to improve customer experience factors.

Presentations

Real-time analytics using Kudu at petabyte scale Session

Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.

Tyler Akidau is a staff software engineer at Google. The current tech lead for internal streaming data processing systems (e.g., MillWheel), Tyler has spent five years working on massive-scale streaming data processing systems. He passionately believes in streaming data processing as the more general model of large-scale computation. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Learn stream processing with Apache Beam Tutorial

Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau, Frances Perry, and Jesse Anderson cover the basics of robust stream processing with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow.

The evolution of massive-scale data processing Session

Join Tyler Akidau for a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, as Tyler compares and contrasts systems at Google with popular open source systems in use today.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Sridhar Alla is the director of big data solutions and architecture at Comcast, where he has delivered several key solutions, such as the Xfinity personalization platform, clickthru analytics, and the correlation platform. Sridhar started his career in network appliances on NAS and caching technologies. Previously, he served as the CTO of security company eIQNetworks, where he merged the concepts of big data and security products. He holds patents on the topics of very large-scale processing algorithms and caching.

Presentations

Real-time analytics using Kudu at petabyte scale Session

Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.

Anima Anandkumar is a principal scientist at Amazon Web Services. Anima is currently on leave from UC Irvine, where she is an associate professor. Her research interests are in the areas of large-scale machine learning, nonconvex optimization, and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. Previously, she was a postdoctoral researcher at MIT and a visiting researcher at Microsoft Research New England. Anima is the recipient of several awards, including the Alfred. P. Sloan fellowship, the Microsoft faculty fellowship, the Google research award, the ARO and AFOSR Young Investigator awards, the NSF CAREER Award, the Early Career Excellence in Research Award at UCI, the Best Thesis Award from the ACM SIGMETRICS society, the IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums, such as the Quora ML session, Huffington Post, Forbes, and O’Reilly Media. Anima holds a BTech in electrical engineering from IIT Madras and a PhD from Cornell University.

Presentations

Distributed deep learning on AWS using MXNet Session

Anima Anandkumar demonstrates how to use preconfigured Deep Learning AMIs and CloudFormation templates on AWS to help speed up deep learning development and shares use cases in computer vision and natural language processing.

Jesse Anderson is a data engineer, creative engineer, and CEO of Smoking Hand. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in prestigious publications such as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse Anderson.com.

Presentations

Learn stream processing with Apache Beam Tutorial

Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau, Frances Perry, and Jesse Anderson cover the basics of robust stream processing with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow.

Real-time data engineering in the cloud 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks—both open source and managed cloud services—discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time data engineering in the cloud (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks—both open source and managed cloud services—discusses the leading cloud providers, and explains how to choose the right one for your company.

Michael Armbrust is the lead developer of the Spark SQL project at Databricks. Michael’s interests broadly include distributed systems, large-scale structured storage, and query optimization. Michael has a PhD from UC Berkeley. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence.

Presentations

How Spark can fail or be confusing and what you can do about it Session

Just like any six-year-old, Apache Spark does not always do its job and can be hard to understand. Michael Armbrust and Eric Liang look at the top causes of job failures customers encountered in production and examine ways to mitigate such problems by modifying Spark. They also share a methodology for improving resilience: a combination of monitoring and debugging techniques for users.

Making Structured Streaming ready for Production - Updates and Future Directions Session

In Apache Spark 2.0, we introduced the core APIs of Structured Streaming, a new streaming processing engine on Spark SQL. Since then, we have focused our efforts on making the engine ready for production use. In this talk, we are going to talk about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases.

Shivnath Babu is an associate professor of computer science at Duke University, where his research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. He is also the chief scientist at Unravel Data Systems, the company he cofounded to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has received a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award. He has given talks and distinguished lectures at many research conferences and universities worldwide. Shivnath has also spoken at industry conferences, such as the Hadoop Summit.

Presentations

Deep learning for IT operations intelligence using open source tools Session

Shivnath Babu offers an introduction to using deep learning to solve complex problems in IT operations analytics. Shivnath focuses on how deep learning can derive operations insights automatically for the complex big data application stack composed of systems such as Hadoop, Spark, Cassandra, Elasticsearch, and Impala, using examples of open source tools for deep learning.

Vishal Bamba is vice president of strategy and architecture at Transamerica Technology, where he leads a team focusing on innovation initiatives within the enterprise. Vishal has over 15 years of experience in distributed systems and has led many innovation projects. He has consulted and worked for several companies including Disney, Getty, Northrop, and AIG/SunAmerica. Vishal holds an MS in computer science from the University of Southern California.

Presentations

Transamerica's journey to Customer 360 and beyond Session

Vishal Bamba and Rocky Tiwari offer an overview of Transamerica's Customer 360 platform and the work done afterward to utilize this technology, including graph databases and machine learning to help create targeted segments for products and campaign.

Dorna Bandari is a data scientist at Pinterest, where she specializes in developing new machine-learning models in a broad range of product areas, from concept creation to productionization.

Presentations

Clustering user sessions with NLP methods in complex internet applications Session

Most internet companies record a constant stream of logs as a user interacts with their application. Depending on the complexity of the application, the logs can be extremely difficult to decipher. Dorna Bandari presents a novel NLP-based method for clustering user sessions in consumer internet applications, which has proved to be extremely effective in both driving strategy and personalization.

Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Before joining Amazon, Roger was in the Cloud Machine Learning group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.

Presentations

Amazon Kinesis data streaming services Session

Roger Barga offers an overview of Kinesis, Amazon’s data streaming platform, which includes Kinesis Firehose, Kinesis Analytics, and Kinesis Streams, and explains how customers have architected their applications using Kinesis services for low-latency and extreme scale.

Alon Bartur brings a wealth of field experience to Trifacta’s product management team with his experience in product management, alliances, and sales engineering, where as director of product management, he works closely with customers and partners to drive the product roadmap and requirements for Trifacta. Prior to joining Trifacta, Alon worked at GoodData and Google.

Presentations

Beyond polarization: Data UX for a diversity of workers Session

Joe Hellerstein, Giorgio Caviglia, and Alon Bartur share their design philosophy for users and their experience designing UIs, illustrating their design principles with core elements from Trifacta, including the founding technology of predictive interaction, recent innovations like transform builder, and other developments in their core transformation experience.

Ryan Baumann is a sales and solutions engineer at Mapbox, where he integrates engineering and sales skills to help cities use open data to make better decisions and show how Mapbox tools can be used for spatial analysis and visualization. Ryan has experience building end-to-end solutions for mining and construction customers to improve safety, productivity, and availability. Previously, he worked as a solutions engineer at Caterpillar, where he built applications to make mining operations more efficient and environmentally friendly. As a former pro cyclist, Ryan loves all things outdoors. You’ll find him scouting out Marin County on his mountain bike or hiking Mt. Diablo on the weekends. Ryan has a bachelor’s degree in mechanical engineering from the University of Wisconsin-Madison.

Presentations

Transforming cities with Mapbox and open data Tutorial

Ryan Baumann explains how Mapbox Cities helps transform transportation and safety using open data, spatial analysis, and Mapbox tools.

Gil Benghiat is one of three founders of DataKitchen, a company on a mission to enable analytic teams to deliver value quickly and with high quality. Gil’s career has always been data oriented and has included positions collecting and displaying network data at AT&T Bell Laboratories (now Alcatel-Lucent), managing data at Sybase (purchased by SAP), collecting and cleaning clinical trial data at PhaseForward (IPO then purchased by Oracle), integrating pharmaceutical sales data at LeapFrogRx (purchased by Model N), and liberating data at Solid Oak Consulting. Gil holds an MS in computer science from Stanford University and a BS in applied mathematics and biology from Brown University. He has hiked all 48 of New Hampshire’s 4,000 peaks and is now working on the New England 67.

Presentations

Seven shocking steps to Agile analytic operations Session

Data analysts, data scientists, and data engineers are already working in teams delivering insight and analysis, but how do you get the team to support experimentation and insight delivery without ending up in an IT versus data engineer versus data scientist war? Christopher Bergh and Gil Benghiat present the seven shocking steps to get these groups of people working together.

Christopher Bergh is a founder and head chef at DataKitchen, where, among other activities, he leads DataKitchen’s Agile Data initiative. Chris has more than 25 years of research, engineering, analytics, and executive management experience. Previously, he was regional vice president in the Revenue Management Intelligence group in Model N; was COO of LeapFrogRx, a descriptive and predictive analytics software and service provider (where he led the 2012 acquisition of LeapFrogRx by Model N); was CTO and vice president of product management of MarketSoft (now part of IBM), an innovative enterprise marketing management software vendor; developed Microsoft Passport, the predecessor to Windows Live ID, a distributed authentication system used by hundreds of millions of users today (and was awarded a US patent for his work on that project); led the technical architecture and implementation of Firefly Passport, an early leader in internet personalization and privacy acquired by microsoft; and led the development of the first travel-related ecommerce web site at NetMarket. Chris began his career at the Massachusetts Institute of Technology’s Lincoln Laboratory and NASA’s Ames Research Center, where he created software and algorithms that provided aircraft arrival optimization assistance to air traffic controllers at several major airports in the United States. Chris also served as a Peace Corps volunteer math reacher in Botswana, Africa. Chris holds an MS from Columbia University and a BS from the University of Wisconsin-Madison. He is an avid cyclist, hiker, and reader and is the father of two teenagers.

Presentations

Seven shocking steps to Agile analytic operations Session

Data analysts, data scientists, and data engineers are already working in teams delivering insight and analysis, but how do you get the team to support experimentation and insight delivery without ending up in an IT versus data engineer versus data scientist war? Christopher Bergh and Gil Benghiat present the seven shocking steps to get these groups of people working together.

Cesar Berho is a senior security researcher at Intel and a committer to the Apache Spot project. Cesar has 12 years of experience working within the cybersecurity industry in positions in operations, design, engineering, and research. Recently, he has been focusing on new ways to analyze telemetry sources with analytics and benchmarking security implementations.

Presentations

Paint the landscape and secure your data center with Apache Spot Session

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud-scale for anomaly detection.

Rahul Bhartia is a solutions architect at Amazon Web Services, where he helps AWS technology partners architect big data and analytics applications for the cloud. Prior to joining AWS, Rahul spent a number of years gaining experience in architecting data and information processing systems at companies including PayPal and Cognizant. Rahul holds a bachelor of science in information science.

Presentations

Building your first big data application on AWS Tutorial

Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Rahul Bhartia walks you through building a big data application in real time using a combination of open source technologies, including Apache Hadoop, Spark, and Zeppelin, as well as AWS managed services such as Amazon EMR, Amazon Kinesis, and more.

Joseph Blue is a data scientist at MapR. Previously, Joe developed predictive models in healthcare for Optum (a division of UnitedHealth) as chief scientist and was the first fellow for Optum’s startup, Optum Labs. Before his time at Optum, Joe accumulated 10 years of analytics experience at LexisNexis, HNC Software, and ID Analytics (now LifeLock), specializing in business problems such as fraud and anomaly detection. He is listed on several patents.

Presentations

Applying machine learning to live patient data Session

Joseph Blue walks you through a reference application that processes ECG data encoding using HL7 with a modern anomaly detector, demonstrating how combining visualization and alerting enables healthcare professionals to improve outcomes and reduce costs. Joseph will share the architecture and code as well as lessons learned from his experience dealing with real data in real medical situations.

Joseph Bradley is a developer working on machine learning at Databricks. Joseph is an Apache Spark committer and PMC member. Previously, he was a postdoc at UC Berkeley. Joseph holds a PhD in machine learning from Carnegie Mellon University, where he focused on scalable learning for probabilistic graphical models, examining trade-offs between computation, statistical efficiency, and parallelization.

Presentations

Best practices for deep learning on Apache Spark Session

Joseph Bradley and Tim Hunter share best practices for building deep learning pipelines with Apache Spark, covering cluster setup, data ingest, tuning clusters, and monitoring jobs—all demonstrated using Google’s TensorFlow library.

Matt Brandwein is director of product management at Cloudera, driving the platform’s experience for data scientists and data engineers. Before that, Matt led Cloudera’s product marketing team, with roles spanning product, solution, and partner marketing. Previously, he built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in computer science and mathematics from the University of Massachusetts Amherst.

Presentations

Making self-service data science a reality Session

Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality.

Claudiu Branzan is the director of data science at G2 Web Services where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine-learning and distributed-systems experience. Previously, Claudiu worked for Atigeo Inc, building big data and data science-driven products for various customers.

Presentations

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies Session

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Marissa Brienza is a data scientist, an operations research analyst, and a forever student of the mathematics and statistics field. At Booz Allen Hamilton, she uses her 3+ years of experience in analysis and visualization strategies to extract knowledge from data to serve a variety of client needs. To Marissa, data science is all about turning data into insightful information that facilitates action. She holds a BS in statistical science from the University of South Carolina and is finishing up her MS in operations research at George Mason University. Her “free time” consists of good food, good friends, and her adorable pet bunny, Oliver.

Presentations

What else does your smart car know about you? Session

Data generated from vehicles can be an incredible yet largely untapped source of insights. Building on a talk from Strata + Hadoop World London last year, Charles Givre and his team present the results of their research in applying machine learning to vehicle telematics data.

Kurt Brown leads the Data Platform team at Netflix. Kurt’s group architects and manages the technical infrastructure underpinning the company’s analytics, which includes various big data technologies (e.g., Hadoop, Spark, and Presto), Netflix open sourced applications and services (e.g., Genie and Lipstick), and traditional BI tools (e.g., Tableau and Teradata).

Presentations

The Netflix data platform: Now and in the future Session

The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds.

Charlie Burgoyne is the principal director of data science at frog, where he guides the vision and implementation of the new data science organization and initiatives and helps frog complement its traditional process with rigorous data science. Charlie leads a team of highly trained scientists and engineers from several studios across the world to implement advanced analytics, machine learning, and artificial intelligence into frog products. Previously, Charlie held a variety of roles including director of data science at Rosetta Stone, vice president of R&D for a government contracting firm specializing in cybersecurity and machine learning, a research physicist for the DOE and NNSA, and a research astrophysicist for NASA in conjunction with George Washington University. Charlie holds a master’s degree in astrophysics from Georgetown University and a bachelor’s in nuclear physics from George Washington University. He has a real passion for languages and speaks French, German, and Italian.

Presentations

Bringing data into design: How to craft personalized user experiences Session

From personalized newsfeeds to curated playlists, users want tailored experiences when they interact with their devices. Ricky Hennessy and Charlie Burgoyne explain how frog’s interdisciplinary teams of designers, technologists, and data scientists create data-driven, personalized, and adaptive user experiences.

Michelle Casbon is director of data science at Qordoba. Previously, she was a senior data science engineer at Idibon, where she built tools for generating predictions on textual datasets. Michelle’s development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. Michelle completed a master’s degree at the University of Cambridge, focusing on NLP, speech recognition, speech synthesis, and machine translation. She loves working with open source projects and has contributed to Apache Spark and Apache Flume. Her written work has been featured in the AI section of O’Reilly Radar.

Presentations

Machine learning to automate localization with Apache Spark and other open source tools Session

Supporting multiple locales involves the maintenance and generation of localized strings. Michelle Casbon explains how machine learning and natural language processing are applied to the underserved domain of localization using primarily open source tools, including Scala, Apache Spark, Apache Cassandra, and Apache Kafka.

Giorgio Caviglia is principal UX designer at Trifacta. Previously, Giorgio was part of the Center for Spatial and Textual Analysis at Stanford University, where he designed digital tools to support scholarly research in the digital humanities. Giorgio has been involved in research, consulting and teaching activities at both public and private institutions, such as Stanford University, Politecnico di Milano, ISIA Urbino, IULM, and Accurat. His research focuses on data-driven visualizations and interfaces for the humanities and social sciences. Giorgio’s work has been featured at numerous international conferences and venues, including SIGGRAPH, MIT, Harvard University, MediaLAB Prado, Expo 2010 Shanghai, and Triennale di Milano, and in publications and showcases, such as Visual Complexity, Malofiej, Data Flow, Design for Information, Fast Company, Gizmodo, Gigaom, and Wired. Giorgio holds an MSc in communication design and a PhD in design from the Politecnico di Milano, where he was part of the DensityDesign lab from its beginning until 2013.

Presentations

Beyond polarization: Data UX for a diversity of workers Session

Joe Hellerstein, Giorgio Caviglia, and Alon Bartur share their design philosophy for users and their experience designing UIs, illustrating their design principles with core elements from Trifacta, including the founding technology of predictive interaction, recent innovations like transform builder, and other developments in their core transformation experience.

Vinoth Chandar works on data infrastructure at Uber, with a focus on Hadoop and Spark. Vinoth has keen interest in unified architectures for data analytics and processing. Previously, Vinoth was the LinkedIn lead on Voldemort and worked on Oracle server’s replication engine, HPC, and stream processing.

Presentations

Incremental processing on Hadoop at Uber Session

To fulfill its mission, Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. Vinoth Chandar explores data processing systems for near-real-time use cases, making the case that adding new incremental processing primitives to existing Hadoop technologies can solve many problems at reduced cost and in a unified manner.

Alan Chaney is chief architect and vice president of engineering at Bitvore Corp. Alan started down the technology path years before he knew anything about software. A guy who’s always been fascinated by how things work, he fed his curiosity by disassembling everything from clocks to motorcycles. When he learned that software offered the same sort of intellectual stimulation (but without the solder burns), he quickly changed his focus from rewiring objects to writing code. During his career as an academic and entrepreneur, Alan dedicated himself to reinventing everything from networked storage to streaming media, always focused on doing things better, faster, and more elegantly than previously imagined.

Presentations

Delivering relevant filtered news to save hours of drudgery each day for fixed-interest securities analysts Session

Bitvore Corp’s Bitvore for Munis personalized news surveillance system is rapidly becoming a must-have for all major fixed-interest securities analysts, investors, and brokers working in the three-trillion-dollar municipal bond market in the USA. Alan Chaney explains how Bitvore delivers the few important and relevant articles out of thousands each day, saving users many hours daily.

Bryan Cheng is a backend developer and analytics lead at BlockCypher. Since 2015, he has worked on infrastructure powering bitcoin and other blockchains. As analytics lead, Bryan works to combine BlockCypher’s experience with blockchains of all sizes with the latest in machine-learning and big data analytics to help governments and private industry stay informed and secure. Previously, Bryan cofounded a startup and led a network access control team at UC Berkeley, where he graduated with a BS in materials science and mechanical engineering. When not hacking in Spark or writing Golang, Bryan can be found learning Rust, riding his bike, and exploring VR.

Presentations

Spark, GraphX, and blockchains: Building a behavioral analytics platform for forensics, fraud, and finance Session

Bryan Cheng and Karen Hsu describe how they built machine-learning and graph traversal systems on Apache Spark to help government organizations and private businesses stay informed in the brave new world of blockchain technology. Bryan and Karen also share lessons learned combining these two bleeding-edge technologies and explain how these techniques can be applied to private and federated chains.

Slava Chernyak is a senior software engineer at Google. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Google Cloud Dataflow Streaming from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.

Presentations

Watermarks: Time and progress in Apache Beam (incubating) and beyond Session

Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications.

Rumman Chowdhury comes to data science from a quantitative social science background. Prior to joining Metis, she was a data scientist at Quotient Technology, where she used retailer transaction data to build an award-winning media targeting model. Her industry experience ranges from public policy, to economics, and consulting. Her prior clients include the World Bank, the Vera Institute of Justice, and the Los Angeles County Museum of the Arts. She holds two undergraduate degrees from MIT and a master’s in quantitative methods of the social sciences from Columbia and is currently finishing a political science PhD at the University of California, San Diego. Her dissertation uses machine-learning techniques to determine whether single-industry towns have a broken political process. Rumman’s passion lies in teaching and learning from teaching. In her spare time, she teaches and practices yoga, reads comic books, and works on her podcast.

Presentations

Visualizing the history of San Francisco Session

In collaboration with the Gray Area Foundation for the Arts and Metis Data Science, Rumman Chowdhury created an interactive data art installation with the purpose of educating San Franciscans about their own city. Rumman discusses the challenges of using historical, predigital-era data with D3 and R to craft a compelling and educational story residing at the intersection of art and technology.

Robert Cohen is a senior fellow at the Economic Strategy Institute, where he is directing a new study to examine the economic and business impacts of virtualization of compute, storage and networking infrastructure, big data, and the internet of things—"The New IP." Robert is formulating a series of case studies of firms that are early adopters of these technologies and would appreciate any inquiries from firms that would like to contribute to this analysis.

Presentations

The programmable enterprise: Software is central to innovation Session

Programmable enterprises are developing their businesses around cloud computing, big data, and the internet of things. Robert Cohen explores how infrastructure changes will alter corporate use of software, skilled employees, and strategies, the business and economic impacts of these changes, and the broader impacts of these shifts on our economy and society.

Christopher Colburn is just another data scientist at Netflix.

Presentations

Going real time: Creating online datasets for personalization Session

In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix.

Eric Colson is chief algorithms officer at Stitch Fix as well as an advisor to several big data startups. Previously, Eric was vice president of data science and engineering at Netflix. He holds a BA in economics from SFSU, an MS in information systems from GGU, and an MS in management science and engineering from Stanford.

Presentations

Organizing for data science: Some unintuitive lessons learned for unlocking value Session

Data scientists blend the skills of statisticians, software engineers, and domain experts to create new roles. Data science isn't merely an amalgam of disciplines but rather a gestalt which synthesizes the ethos of various fields. This merits new thinking when it comes to organization. Eric Colson explores some novel—and often unintuitive—ways to unleash the value of your data science team.

Dustin Cote is a customer operations engineer at Confluent. Over his career, Dustin has worked in a variety of roles from Java developer to operations engineer. His most recent focus is distributed systems in the big data ecosystem, with Apache Kafka being his software of choice.

Presentations

Mistakes were made, but not by us: Lessons from a year of supporting Apache Kafka Session

Dustin Cote and Ryan Pridgeon share their experience troubleshooting Apache Kafka in production environments and discuss how to avoid pitfalls like message loss or performance degradation in your environment.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata + Hadoop World conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Thursday Opening Welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday Opening Welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday Opening Welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday Opening Welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Michelangelo D’Agostino is the director of data science R&D at Civis Analytics, where he leads a team that develops statistical models and writes software to help companies and nonprofits leverage their data. As a reformed particle physicist turned data scientist, Michelangelo loves mungeable datasets, machine learning, and long walks on the beach (with a floppy hat, plenty of sunscreen, and a laptop). Michelangelo came to Civis from Braintree, a Chicago-based online payments company that was acquired by PayPal. Prior to Braintree, he was a senior analyst in digital analytics with the 2012 Obama re-election campaign. He helped to optimize the campaign’s email fundraising juggernaut and analyzed social media data.

Michelangelo has been a mentor with the Data Science for Social Good Fellowship. He holds a PhD in particle astrophysics from UC Berkeley and got his start in analytics sifting through neutrino data from the IceCube experiment. Accordingly, he spent two glorious months at the South Pole, where he slept in a tent salvaged from the Korean War and enjoyed the twice-weekly shower rationing. He’s also written about science and technology for the Economist.

Presentations

The power of persuasion modeling Session

How do we know that an advertisement or promotion truly drives incremental revenue? Michelangelo D'Agostino and Bill Lattner share their experience developing machine-learning techniques for predicting treatment responsiveness from randomized controlled experiments and explore the use of these “persuasion” models at scale in politics, social good, and marketing.

Shirshanka Das is the architect for LinkedIn’s Data Analytics Infrastructure team. Shirshanka was one of the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. His current focus at LinkedIn includes all things Hadoop, high-performance distributed OLAP engines, large-scale data ingestion, transformation and movement, and data lineage and discovery.

Presentations

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn Session

Garten & Das describe LinkedIn’s learnings on best practices for using Kafka & Hadoop as the foundation of a petabyte-scale data warehouse. Without a principled approach organizations find themselves struggling to deal with evolution of the business. Concrete suggestions will help you process data seamlessly. Beyond technology we discuss experiences running governance to empower data teams.

Tathagata Das is an Apache Spark committer and a member of the PMC. He is the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.

Presentations

Making Structured Streaming ready for Production - Updates and Future Directions Session

In Apache Spark 2.0, we introduced the core APIs of Structured Streaming, a new streaming processing engine on Spark SQL. Since then, we have focused our efforts on making the engine ready for production use. In this talk, we are going to talk about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases.

Monal Daxini is an engineering manager at Netflix, where he is building a scalable and multitenant event processing pipeline and leads the infrastructure for stream processing as a service. He has worked on Netflix’s Cassandra and Dynamite infrastructure and was instrumental in developing the encoding compute infrastructure for all Netflix content. Monal has 15 years of experience building distributed systems at organizations like Netflix, NFL.com, and Cisco.

Presentations

Going real time: Creating online datasets for personalization Session

In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix.

Netflix Keystone SPaaS: Real-time stream processing as a service Session

Netflix Keystone processes over a trillion events per day with at-least-once processing semantics in the cloud. Monal Daxini explores what it means to offer stream processing as a service (SPaaS), how Netflix implemented a scalable, fault-tolerant multitenant SPaaS internal offering, and how it evolved the system in flight with no downtime.

Gillian Docherty is the CEO of The Data Lab, one of eight Innovation Centre’s across Scotland, where she is responsible for delivering the strategic vision set out by The Data Lab Board, the aim of which is to create over 250 new jobs and generate more than £100 million for the economy. Gillian has over 22 years’ experience working in the IT sector. Previously, she held a range of senior leadership roles at IBM UK, including leader for software business in Scotland, systems and technology sales leader and territory leader for general business Scotland. Gillian is on the board of Tech Partnership Scotland and is also a board member of the Glasgow Chamber of Commerce. Gillian holds a degree in computing science from Glasgow University. She is married and has a daughter.

Presentations

Data-driven innovation Session

Gillian Docherty shares her experience leading The Data Lab, an innovation center focused on helping organizations drive economic and social benefit through data science and analytics. Along the way, Gillian discusses some of the projects her teams have supported, from multinationals to startups, and explains how they leverage academic capability to help drive innovation from data.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Michael Yoder, Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Big data governance for the hybrid cloud: Best practices and how-to Session

Big data needs governance, not just for compliance but also for data scientists. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start—especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky shares a step-by-step approach to kick-start your big data governance initiatives.

Peng Du is a senior software engineer in Uber. He holds a PhD in computer science and an MA in applied mathematics, both from the University of California, San Diego.

Presentations

Uber's data science workbench Session

Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services providing convenient features such as task scheduling, model publishing, and job monitoring.

Ted Dunning has been involved with a number of startups—the latest is MapR Technologies, where he is chief application architect working on advanced Hadoop-related technologies. Ted is also a PMC member for the Apache Zookeeper and Mahout projects and contributed to the Mahout clustering, classification, and matrix decomposition algorithms. He was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Presentations

Tensor abuse in the workplace Session

Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case).

Mike Dusenberry is an engineer at the IBM Spark Technology Center, where he is creating a deep learning library for SystemML and solving for performant deep learning at scale. Mike was on his way to an MD and a career as a physician in his home state of North Carolina when he teamed up with professors on a medical machine-learning research project. Two years later in San Francisco, Mike is contributing to Apache SystemML as a committer and researching medical applications for deep learning.

Presentations

Leveraging deep learning to predict breast cancer proliferation scores with Apache Spark and Apache SystemML Session

Estimating the growth rate of tumors is a very important but very expensive and time-consuming part of diagnosing and treating breast cancer. Michael Dusenberry and Frederick Reiss describe how to use deep learning with Apache Spark and Apache SystemML to automate this critical image classification task.

Barbara Eckman is a principal data architect at Comcast, where she leads data architecture and data governance for an innovative, company-wide, data-intensive infrastructure for real-time ingesting, streaming, storing, retrieving, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project Center, Merck, GlaxoSmithKline, and IBM, where she served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.

Presentations

Data integration and governance for big data with Apache Avro; or, How to solve the GIGO problem Tutorial

Big data famously enables anyone to contribute to the enterprise data store. Integrating previously siloed data can uncover powerful insights for the business. But without data governance, inefficiencies and incorrect business decisions may result. Barbara Eckman explains how Comcast is using Apache Avro for enterprise data governance, the challenges faced, and methods to address these challenges.

Michael Edwards is principal software engineer at Metamarkets. Michael’s idea of full stack developer extends from interaction design in big data analytics systems down to clock/data recovery in backscatter-modulated RF protocols. At Metamarkets, he’s all about scale-up, cost-down, with additional areas of focus in authentication/access control and easy-to-integrate data visualization components.

Presentations

Operating Kafka at petabyte scale Session

Metamarkets operates several Kafka clusters in its real-time streaming event ingestion pathway. Over the last three years, they have grown beyond the original parameters, with hundreds of terabytes flowing through every day, petabytes of retention, and gigabytes of historical data streaming to and from storage. Michael Edwards shares experiences and lessons learned operating Kafka at this scale.

Laura Eisenhardt is EVP at iKnow Solutions- Europe and the founder of DigitalConscience.org, a CSR platform designed to create opportunities for technical resources (specifically expats) to give back to communities with their unique skills whilst making a huge impact locally. Laura has led massive programs for the World Health Organization across Africa, collecting big data in over 165 languages, and specializes in data quality and consistency.

Presentations

Big data as a force for good Session

In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved.

Stephen Elston is an experienced big data geek, data scientist, and software business leader. Steve is principal consultant at Quantia Analytics, LLC, where he leads the building of new business lines, manages P&L, and takes software products from concept and financing through development, intellectual property protection, sales, customer shipment, and support. Steve is also an instructor for the University of Washington data science program. Steve has over two decades of experience in visualization, predictive analytics and machine learning, at scales from small to massive, using many platforms including Hadoop, Spark, R, S/SPLUS, and Python. He has created solutions in fraud detection, capital markets, wireless systems, law enforcement, and streaming analytics for the IoT.

Presentations

Exploration and visualization of large, complex datasets with R, Hadoop, and Spark Tutorial

Divide and recombine techniques provide scalable methods for exploration and visualization of otherwise intractable datasets. Stephen Elston and Ryan Hafen lead a series of hands-on exercises help you develop skills in exploration and visualization of large, complex datasets using R, Hadoop, and Spark.

Susan Eraly is a software engineer at Skymind, where she contributes to Deeplearning4j. Previously, Susan worked as a senior ASIC engineer at NVIDIA and as a data scientist in residence at Galvanize.

Presentations

Scalable deep learning for the enterprise with DL4J Tutorial

Dave Kale, Melanie Warwick, Susan Eraly, and Josh Patterson explain how to build, train, and deploy neural networks using Deeplearning4j. Topics include the fundamentals of deep learning, ND4J and DL4J, and scalable training using GPUs and Apache Spark. You'll gain hands-on experience with several models, including convolutional and recurrent neural nets.

Justin Erickson is a senior director of product management leading Cloudera’s platform team, which is responsible for the components in Cloudera Distribution including Hadoop (CDH) above storage. Prior to joining Cloudera, he led the high-availability and disaster-recovery areas of Microsoft SQL Server.

Presentations

BI and SQL analytics with Hadoop in the cloud Session

Henry Robinson and Justin Erickson explain how to best take advantage of the flexibility and cost-effectiveness of the cloud with your BI and SQL analytic workloads using Apache Hadoop and Apache Impala (incubating) to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments.

Yuliya Feldman is a principal software engineer at MapR, where she works on a number of products and features, including MapR admin infrastructure, the MapReduce framework, and, most recently, YARN. Previously, Yuliya worked in diverse areas of satellite image processing, medical robotics, satellite phone communication, ecommerce, and big data.

Presentations

Pluggable security in Hadoop Session

Security will always be very important in the world of big data, but the choices today mostly start with Kerberos. Does that mean setting up security is always going to be painful? What if your company standardizes on other security alternatives? What if you want to have the freedom to decide what security type to support? Yuliya Feldman discusses your options.

Avrilia Floratau is a senior scientist at Microsoft’s Cloud and Information Services Lab, where her research is focused on scalable real-time stream processing systems. She is also an active contributor to Heron, collaborating with Twitter. Previously, Avrilia was a research scientist at IBM Research working on SQL-on-Hadoop systems. She holds a PhD in data management from the University of Wisconsin-Madison.

Presentations

From rivulets to rivers: Elastic stream processing in Heron Session

Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience.

Dr. Valentine Fontama is a Principal Data Scientist Manager in the Analytics + Insights Data Science team that delivers analytics capabilities across Azure and C+E cloud services. He brings over ten years of Data Science experience: Following a PhD in Neural Networks, he was a New Technology Consultant at Equifax in London where he pioneered the use of Data Mining to improve Risk Assessment and Marketing in the Consumer Credit industry. His last role was Principal Data Scientist in Data & Decision Sciences Group (DDSG) where he led consulting to external customers, including ThyssenKrupp and Dell.

He also has over 7 years of business experience: in prior roles at Microsoft Val was a Senior Product Manager for Big Data and Predictive Analytics in Cloud and Enterprise Marketing. He led product management for Azure Machine Learning; HDInsight; Parallel Data Warehouse, our first ever Data Warehouse appliance, and 3 releases of Fast Track Data Warehouse.
Val holds an M.B.A. in Strategic Management and Marketing from Wharton Business School, a Ph.D. in Neural Networks, M.Sc. in Computing, and B.Sc. in Mathematics and Electronics. He has published 11 academic papers, and co-authored three Big Data books – Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes (2 editions) and Introducing Microsoft Azure HDInsight.

Presentations

How Microsoft predicts churn of cloud customers using deep learning and explains those predictions in an interpretable way Session

Although deep learning has proved to be very powerful, few results are reported on its application to business-focused problems. Feng Zhu and Val Fontama explore how Microsoft built a deep learning-based churn predictive model and demonstrate how to explain the predictions using LIME—a novel algorithm published in KDD 2016—to make the black box models more transparent and accessible.

Rodrigo Fontecilla is vice president of application modernization for Unisys Federal Systems, where he leads all aspects of software development, system integration, mobile development, and data management focused on the federal government. Rod is responsible for providing leadership, coordination, and oversight on all IT solutions, emerging technologies, and IT services delivery to the federal government. He has more than 25 years of professional experience in the capture, design, development, implementation, and management of information management systems delivering mission-critical IT solutions and has an extensive background and expertise in cloud computing, mobile development, social media, enterprise architecture, data analytics, SOA-based solutions, and IT governance.

Presentations

Machine-learning opportunities within the airline industry Session

Rodrigo Fontecilla explains how many of the largest airlines use different classes of machine-learning algorithms to create robust and reusable predictive models to provide a holistic view of operations and provide business value.

Machine-learning opportunities within the airline industry Session

Rodrigo Fontecilla explains how many of the largest airlines use different classes of machine-learning algorithms to create robust and reusable predictive models to provide a holistic view of operations and provide business value.

Michael J. Freedman is a professor in the Computer Science department at Princeton University as well as the cofounder and CTO of iobeam, which provides data infrastructure for the internet of things. His research broadly focuses on distributed systems, networking, and security. He developed and operates several self-managing systems, including CoralCDN (a decentralized content distribution network) and DONAR (a server resolution system that powered the FCC’s Consumer Broadband Test), both of which serve millions of users daily. Michael’s other research has included software-defined and service-centric networking, cloud storage and data management, untrusted cloud services, fault-tolerant distributed systems, virtual world systems, peer-to-peer systems, and various privacy-enhancing and anticensorship systems. Freedman’s work on IP geolocation and intelligence led him to cofound Illuminics Systems, which was acquired by Quova (now part of Neustar) in 2006. His work on programmable enterprise networking (Ethane) helped form the basis for the OpenFlow/software-defined networking (SDN) architecture. Honors include the Presidential Early Career Award for Scientists and Engineers (PECASE), a Sloan fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. Michael holds a PhD in computer science from NYU’s Courant Institute and both an SB and an MEng degree from MIT.

Presentations

Designing a time series database to support IoT workloads Session

IoT applications often need more-complex queries than those supported by traditional time series databases. Michael Freedman outlines a new distributed time series database for such workloads, supporting efficient queries, including complex predicates across many metrics, while scaling out to support IoT ingest rates.

Ellen Friedman is a solutions consultant, scientist, and O’Reilly author currently writing about a variety of open source and big data topics. Ellen is a committer on the Apache Drill and Mahout projects. With a PhD in biochemistry and years of work writing on a variety of scientific and computing topics, she is an experienced communicator. Ellen is coauthor of Streaming Architecture, the Practical Machine Learning series from O’Reilly, Time Series Databases and her newest title, Introduction to Apache Flink. She’s also coauthor of a book of magic-themed cartoons, A Rabbit Under the Hat. Ellen has been an invited speaker at Strata + Hadoop London, Berlin Buzzwords, University of Sheffield Methods Institute, and the Philly ETE conference and a keynote speaker for NoSQL Matters 2014 in Barcelona.

Presentations

Why stream? The advantages of working with streaming data Tutorial

Life doesn’t happen in batches. Being able to work with data from continuous events as data streams is a better fit to the way life happens, but doing so presents some challenges. Ellen Friedman examines the advantages and issues involved in working with streaming data, takes a look at emerging technologies for streaming, and describes best practices for this style of work.

Ajit Gaddam is chief security architect at Visa. Ajit is a technologist, serial entrepreneur, and a security expert specializing in machine learning, cryptography, big data security, and cybersecurity issues. Over the last decade, Ajit held senior roles at various tech and financial firms and founded two startups. He is an active participant in various open source and security architecture standards bodies. As a well-known security expert and industry veteran, he has authored numerous articles and white papers for publication and is a frequent speaker at high-profile conferences such as BlackHat, Strata + Hadoop World, and SABSA World Congress. He holds multiple patents in data security and other disruptive technologies.

Presentations

End-to-end security for Kafka, Spark ML, and Hadoop Session

Apache Kafka is today used by over 35% of Fortune 500 companies to store and process some of their most sensitive datasets. Ajit Gaddam and Jiphun Satapathy provide a security reference architecture to secure your Kafka cluster while leveraging it to support your organization's cybersecurity requirements.

Sriram Ganesan is a member of the technical staff at Qubole, where he works on HBase and cluster orchestration. Previously, Sriram was at Directi, where he worked on scaling the backend of leading chat app Talk.to. Sriram holds a bachelor of computer science engineering from the National Institute of Technology, Trichy, India.

Presentations

Moving big data as a service to a multicloud world Session

Qubole started out by offering Hadoop as a service in AWS. Over time, it extended its big data capabilities beyond Hadoop and its cloud infrastructure support beyond AWS. Sriram Ganesan and Prakhar Jain explain how and why Qubole built Cloudman, a simple, cloud-agnostic, multipurpose provisioning tool that can be extended for further engines and further cloud support.

Katherine Garcia is a member of the founding team for the Center for Open Data Enterprise, where she is the director of communications and outreach. Katherine is the project lead for the Open Data Roundtables, gatherings that identify case studies, lessons learned, and best practices in open data across the federal government. She also manages a foundation-funded project to develop a transition plan for open data in the next administration. In 2015, Katherine launched the Washington Viz-ards, a meetup group for data visualization enthusiasts in DC. Previously Katherine worked at the Governance Lab at New York University, the US Mission to the United Nations, and Bonnier Corporation. She has over 10 years of media and publishing experience. Katherine is a former copresident of the New York chapter of the US National Committee for UN Women, which aims to increase public, political, and financial support for UN Women—the United Nations entity for women’s empowerment and gender equality.

Presentations

The future of open data: Building businesses with a major national resource Session

Open government data—free public data than anyone can use and republish—is a major resource for entrepreneurs and innovators. The Center for Open Data Enterprise has partnered with the White House, government, and businesses to show how this resource can create economic value. Joel Gurin and Katherine Garcia share case studies of how open data is being used and a vision for its future.

Yael Garten leads a team of data scientists at LinkedIn that focuses on understanding and increasing growth and engagement of LinkedIn’s 400 million members across mobile and desktop consumer products. Yael is an expert at converting data into actionable product and business insights that impact strategy. Her team partners with product, engineering, design, and marketing to optimize the LinkedIn user experience, creating powerful data-driven products to help LinkedIn’s members be productive and successful. Yael champions data quality at LinkedIn; she has devised organizational best practices for data quality and developed internal data tools to democratize data within the company. Yael also advises companies on informatics methodologies to transform high throughput data into insights and is a frequent conference speaker. She has a PhD in biomedical informatics from the Stanford University School of Medicine, where her research focused on information extraction via natural language processing to understand how human genetic variations impact drug response. She holds an MSc from the Weizmann Institute of Science in Israel.

Presentations

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn Session

Garten & Das describe LinkedIn’s learnings on best practices for using Kafka & Hadoop as the foundation of a petabyte-scale data warehouse. Without a principled approach organizations find themselves struggling to deal with evolution of the business. Concrete suggestions will help you process data seamlessly. Beyond technology we discuss experiences running governance to empower data teams.

How we work: The unspoken challenges of doing data science Session

Data science is a rewarding career. It's also really hard—not just the technical work itself but also "how to do the work well" in an organization. Yael Garten explores what data scientists do, how they fit into the broader company organization, and how they can excel at their trade and shares the hard and soft skills required, tips and tricks for success, and challenges to watch out for.

Tim Gasper is cofounder of Ponos, an IoT-connected, big data hydroponics farming solution that helps you grow fresh vegetables, fruit, and herbs in your own home with a fully automated, indoor smart farm. He is also Director of Product Marketing at Bitfusion, GPU virtualization company for easier, more scalable deep learning. Tim has over eight years of big data, IoT, and enterprise content product management and product marketing experience. He is also a writer and speaker on entrepreneurship, the lean startup methodology, and big data analytics. Previously, Tim was global portfolio manager for CSC Big Data and Analytics, where he was responsible for the overall strategy, roadmap, partnerships, and technology mix for the big data and analytics product portfolio; vice president of product at Infochimps (acquired by CSC), where he led product development for its market-leading open data marketplace and big data platform as a service; and cofounder of Keepstream, a social media analytics and curation company.

Presentations

Robot farmers and chefs: In the field and in your kitchen Session

Food production and preparation have always been labor and capital intensive, but with the internet of things, low-cost sensors, cloud-computing ubiquity, and big data analysis, farmers and chefs are being replaced with connected, big data robots—not just in the field but also in your kitchen. Tim Gasper explores the tech stack, data science techniques, and use cases driving this revolution.

Bas Geerdink is a programmer, scientist, and IT manager at ING, where he is responsible for the fast data systems that process and analyze streaming data. Bas has a background in software development, design, and architecture with broad technical experience from C++ to Prolog to Scala. His academic background is in artificial intelligence and informatics. Bas’s research on reference architectures for big data solutions was published at the IEEE conference ICITST 2013. He occasionally teaches programming courses and is a regular speaker at conferences and informal meetings.

Presentations

Building a streaming analytics solution to provide real-time actionable insights to customers Tutorial

ING is a data-driven enterprise that is heavily investing in big data, analytics, and streaming processing. Bas Geerdink offers an overview of ING's streaming analytics solution for providing actionable insights to customers, built with a combination of open source technologies, including Kafka, Flink, and Cassandra.

Charles Givre is an unapologetic data geek who is passionate about helping others learn about data science and become passionate about it themselves. For the last five years, Charles has worked as a data scientist at Booz Allen Hamilton for various government clients and has done some really neat data science work along the way, hopefully saving US taxpayers some money. Most of his work has been in developing meaningful metrics to assess how well the workforce is performing. For the last two years, Charles has been part of the management team for one of Booze Allen Hamilton’s largest analytic contracts, where he was tasked with increasing the amount of data science on the contract—both in terms of tasks and people.

Even more than the data science work, Charles loves learning about and teaching new technologies and techniques. He has been instrumental in bringing Python scripting to both his government clients and the analytic workforce and has developed a 40-hour Introduction to Analytic Scripting class for that purpose. Additionally, Charles has developed a 60-hour Fundamentals of Data Science class, which he has taught to Booz Allen staff, government civilians, and US military personnel around the world. Charles has a master’s degree from Brandeis University, two bachelor’s degrees from the University of Arizona, and various IT security certifications. In his nonexistent spare time, he plays trombone, spends time with his family, and works on restoring British sports cars.

Presentations

What else does your smart car know about you? Session

Data generated from vehicles can be an incredible yet largely untapped source of insights. Building on a talk from Strata + Hadoop World London last year, Charles Givre and his team present the results of their research in applying machine learning to vehicle telematics data.

David Goodman is NetHope’s CIO-In Residence. David focuses on bringing both technical and thought leadership to bear on the implementation of NetHope’s strategic plan. Toward that end, he works on developing strategies to bolster NetHope’s relationship with the technology sector and works with the NetHope leadership team to ensure NetHope activities are properly orientated towards its key constituents, CIOs and technology leaders in the global development sector.

Before joining NetHope, David served as the CIO of the International Rescue Committee where he had global responsibility for all technology-related activities and oversaw teams focused on infrastructure, application development, user services and project management.

Presentations

Big data as a force for good Session

In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved.

Josh Gordon works as a developer advocate on TensorFlow at Google. He’s passionate about machine learning and computer science education. In his free time, Josh loves biking, running, and exploring the great outdoors.

Presentations

Getting started with TensorFlow Tutorial

Josh Gordon walks you through training and deploying a machine-learning system using TensorFlow, a popular open source library. You'll learn how to build machine-learning systems from simple classifiers to complex image-based models and how to deploy models in production using TensorFlow Serving.

Felix Gorodishter is a software architect at GoDaddy. Felix is a web developer, technologist, entrepreneur, husband, and daddy.

Presentations

Big data for operational insights Session

GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+ TB per day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch.

Bill Graham is a staff engineer on the Real Time Compute team at Twitter. Bill’s primary areas of focus are data processing applications and analytics infrastructure. Previously, he was a principal engineer at CBS Interactive and CNET Networks, where he worked on ad targeting and content publishing infrastructure, and a senior engineer at Logitech focusing on webcam streaming and messaging applications. Bill contributes to a number of open source projects, including HBase, Hive, and Presto, and he’s a Heron and Pig committer.

Presentations

From rivulets to rivers: Elastic stream processing in Heron Session

Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience.

Arthur Grava is the big data team leader at Luizalabs, where he works closely with the company’s recommender system and focuses on machine learning with Hadoop, Java, Cassandra, and Python. Arthur holds a master’s degree in recommender systems from USP.

Presentations

Building a recommender from a big behavior graph over Cassandra Session

Gleicon Moraes and Arthur Grava share war stories about developing and deploying a cloud-based large-scale recommender system for a top-three Brazilian ecommerce company. The system, which used Cassandra and graph traversal, led to a more than 15% increase in sales.

Jamie Grier is director of applications engineering at data Artisans, where he’s extremely excited to be able to help others realize the potential of Flink in their own projects. Jaime’s goal is to help others design systems to solve challenging problems in the real world. Jamie has been working in the field of streaming computation for the last decade, spanning everything from ultra-high-performance video stream acquisition and processing to social media analytics. Prior to joining data Artisans, Jamie was at Twitter working on rethinking the real-time analytics stack with the goals of making it much more efficient and also capable of computing accurate results in real time without relying on the lambda architecture for correctness. Before Twitter, Jamie was one of the lead engineers at Gnip building their social media streaming, filtering, and delivery system and the lead engineer at Boulder Imaging working on systems that could ingest and process greater than 1 GB per second of streaming video on a single machine. Jamie is interested in streaming computation and mechanically sympathetic software architectures. He is particularly interested in building systems that are both high performance and highly scalable. His favorite quote is “You can have a second computer once you’ve shown you know how to use the first one.”

Presentations

Apache Flink: The latest and greatest Session

Jamie Grier outlines the latest important features in Apache Flink and walks you through building a working demo to show these features off. Topics include queryable state, dynamic scaling, streaming SQL, very large state support, and whatever is the latest and greatest in March 2017.

Robert Grossman is a faculty member and the chief research informatics officer in the Biological Sciences Division of the University of Chicago. Robert is the director of the Center for Data Intensive Science (CDIS) and a senior fellow at both the Computation Institute (CI) and the Institute for Genomics and Systems Biology (IGSB). He is also the founder and a partner of the Open Data Group, which specializes in building predictive models over big data. Robert has led the development of open source software tools for analyzing big data (Augustus), distributed computing (Sector), and high-performance networking (UDT). In 1996, he founded Magnify, Inc., which provides data-mining solutions to the insurance industry and was sold to ChoicePoint in 2005. He is also the chair of the Open Cloud Consortium, a not-for-profit that supports the research community by operating cloud infrastructure, such as the Open Science Data Cloud. He blogs occasionally about big data, data science, and data engineering at Rgrossman.com.

Presentations

The dangers of statistical significance when studying weak effects in big data: From natural experiments to p-hacking Session

When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices so that you will not be accused of p-hacking.

Mark Grover is a software engineer working on Apache Spark at Cloudera. Mark is a committer on Apache Bigtop and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Carlos Guestrin is the director of machine learning at Apple and the Amazon Professor of Machine Learning in Computer Science and Engineering at the University of Washington. Carlos was the cofounder and CEO of Turi (formerly Dato and GraphLab), a machine-learning company acquired by Apple. A world-recognized leader in the field of machine learning, Carlos was named one of the 2008 Brilliant 10 by Popular Science. He received the 2009 IJCAI Computers and Thought Award for his contributions to artificial intelligence and a Presidential Early Career Award for Scientists and Engineers (PECASE).

Presentations

Paying the technical debt of machine learning: Managing ML models in production Session

The growth in machine-learning deployments has led to a commensurate huge (and well-documented) accumulation of technical debt. Carlos Guestrin outlines some of the key challenges in large-scale deployments of many interacting machine-learning models and shares a methodology for management, monitoring, and optimization of such models in production, which helps mitigate the technical debt.

Debraj GuhaThakurta is a senior data scientist in Microsoft’s Azure Machine Learning group, where he focuses on the use of different platforms and toolkits, such as Microsoft’s Cortana Analytics suite, R Server, SQL Server, Hadoop, and Spark clusters, for creating scalable and operationalized analytical processes for various business problems. Debraj has extensive industry experience in the biopharma and financial forecasting domains. He holds a PhD in chemistry and biophysics and did postdoctoral research in machine-learning applications in genomics. Debraj has published more than 25 peer-reviewed papers, book chapters, and patents.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Sijie Guo is the tech lead of Twitter’s Messaging group. Sijie is the cocreator of Apache DistributedLog and the PMC chair of Apache BookKeeper.

Presentations

Building reliable real-time services with Apache DistributedLog Session

Apache DistributedLog (incubating) is a low-latency, high-throughput replicated log service. Sijie Guo shares how Twitter has used DistributedLog as the real-time data foundation in production for years, supporting services like distributed databases, pub-sub messaging, and real-time stream computing and delivering more than 1.5 trillion (17 PB) events per day.

Shekhar Gupta is a software engineer at Pepperdata. He holds a PhD from TU Delft, where he focused on using machine learning to improve and monitor the performance of distributed systems.

Presentations

Big data for big data: Machine-learning models of Hadoop cluster behavior Session

Sean Suchter and Shekhar Gupta describe the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.

Joel Gurin is president and founder of the Center for Open Data Enterprise, a Washington-based nonprofit that works to maximize the value of open data as a public resource. Joel also serves as a senior open data consultant to the World Bank. Previously, he wrote the book Open Data Now and led the launch team for the GovLab’s Open Data 500 study and Open Data Roundtables. Joel served as chair of the White House Task Force on Smart Disclosure and as chief of the Consumer and Governmental Affairs Bureau of the US Federal Communications Commission. For over a decade, Joel was editorial director and executive vice president of Consumer Reports, where he directed the development of ConsumerReports.org, the world’s largest paid-subscription, information-based website. He can be reached at joel@odenterprise.org.

Presentations

The future of open data: Building businesses with a major national resource Session

Open government data—free public data than anyone can use and republish—is a major resource for entrepreneurs and innovators. The Center for Open Data Enterprise has partnered with the White House, government, and businesses to show how this resource can create economic value. Joel Gurin and Katherine Garcia share case studies of how open data is being used and a vision for its future.

Ryan Hafen is an independent statistical consultant and an adjunct assistant professor in the Statistics Department at Purdue University. Ryan’s research focuses on methodology, tools, and applications in exploratory analysis, statistical model building, and machine learning on large, complex datasets. He is the developer of the datadr and Trelliscope components of the Tessera project (now DeltaRho) as well as the rbokeh visualization package. Ryan’s applied work on analyzing large, complex data has spanned many domains, including power systems engineering, nuclear forensics, high-energy physics, biology, and cybersecurity. Ryan holds a BS in statistics from Utah State University, an MStat in mathematics from the University of Utah, and a PhD in statistics from Purdue University.

Presentations

Exploration and visualization of large, complex datasets with R, Hadoop, and Spark Tutorial

Divide and recombine techniques provide scalable methods for exploration and visualization of otherwise intractable datasets. Stephen Elston and Ryan Hafen lead a series of hands-on exercises help you develop skills in exploration and visualization of large, complex datasets using R, Hadoop, and Spark.

Matar Haller is a data scientist at Winton Capital. Previously, Matar was a neuroscientist at UC Berkeley, where she recorded and analyzed signals from electrodes surgically implanted in human brains.

Presentations

Automatic speaker segmentation: Using machine learning to identify who is speaking when Session

With the exploding growth of video and audio content online, there's an increasing need for indexable and searchable audio. Matar Haller demonstrates how to automatically identify who is speaking when in a recorded conversation using machine learning applied to a corpus of audio recordings. Matar shares how she approached the problem, the algorithms used, and steps taken to validate the results.

Bryan Harrison is vice president of credit and operational risk business intelligence at American Express. With more than 15 years’ business and IT experience in risk, analytics, and business intelligence capabilities leading global, complex projects spanning both business and IT organizations, Bryan has helped organizations manage strategic change with a combination of process improvement and innovation. Having held roles in both IT and analytics, Bryan understands both the business and the technical side of BI and big data, enabling him to bridge the gap between people, processes, and technology.

Presentations

How American Express scaled BI on Hadoop for interactive, billion-row, 2000+-user queries Tutorial

When processing 24% of total global credit card transactions, data, risk, and security are top priorities for American Express. Bryan Harrison highlights the modern process, people, and architecture approach that has enabled Amex to scale BI on Hadoop and given instant access to real-time, granular data, as well as broad historical views to model, so Amex can stay ahead of fraud in the future.

Joseph M. Hellerstein is the Jim Gray Chair of Computer Science at UC Berkeley and cofounder and CSO at Trifacta. Joe’s work focuses on data-centric systems and the way they drive computing. He is an ACM fellow, an Alfred P. Sloan fellow, and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune Magazine among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.

Presentations

Beyond polarization: Data UX for a diversity of workers Session

Joe Hellerstein, Giorgio Caviglia, and Alon Bartur share their design philosophy for users and their experience designing UIs, illustrating their design principles with core elements from Trifacta, including the founding technology of predictive interaction, recent innovations like transform builder, and other developments in their core transformation experience.

Seth Hendrickson is a data scientist and Scala developer in IBM’s Spark Technology Center. Seth is focused on developing highly parallel machine-learning algorithms for the Apache Spark cluster computing ecosystem.

Presentations

Spark Structured Streaming for machine learning Session

Structured Streaming is new in Apache Spark 2.0 and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model.

Ricky Hennessy is a data scientist at frog’s Austin studio, where he works on multidisciplinary teams to design and prototype data-driven products and help clients craft intelligent data strategies. Ricky has worked with clients in the defense, financial, insurance, professional sports, and retail industries. In addition to his work with frog, Ricky also works as a data science instructor at General Assembly. Ricky holds a PhD in biomedical engineering from UT Austin, where he gained expertise in scientific research, machine learning, algorithm development, and data analysis.

Presentations

Bringing data into design: How to craft personalized user experiences Session

From personalized newsfeeds to curated playlists, users want tailored experiences when they interact with their devices. Ricky Hennessy and Charlie Burgoyne explain how frog’s interdisciplinary teams of designers, technologists, and data scientists create data-driven, personalized, and adaptive user experiences.

Craig Hibbeler is principal for big data and security within MasterCard Advisors’ Enterprise Information Management consultancy practice. In his role, Craig leverages practical hands-on experience and broad industry and platform knowledge to develop, execute, secure, and drive results with customers’ big data platforms and initiatives.

Presentations

Big data as a force for good Session

In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved.

Bob Horton is a senior data scientist in the Microsoft Partner Ecosystem. Bob came to Microsoft from Revolution Analytics, where he was on the Professional Services team. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic.) Some time after that, he got an MS in computer science from California State University, Sacramento. Bob currently holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Juliet Hougland is a data scientist at Cloudera and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil and gas pipelines at Deep Signal and designing and building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in applied mathematics from the University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in math-physics.

Presentations

Guerrilla guide to Python and Apache Hadoop Tutorial

Using an interactive demo format with accompanying online materials and data, data scientist Juliet Hougland offers a practical overview of the basics of using Python data tools with a Hadoop cluster.

Nischal HP is the cofounder and data scientist at Unnati Data Labs, where he is building end-to-end data science systems in the fields of fintech, marketing analytics, and event management. Nischal is also a mentor for data science on Springboard. Previously he built, from scratch, various ecommerce systems for catalog management, recommendation engines, and sentiment analyzer during his tenure at Redmart and built various data crawlers and intention mining systems and laid down initial work on an end-to-end text mining and analysis pipeline at SAP Labs. The majority of his work, however, was centered around building gamification of technical indicators for algorithmic trading platforms. Nischal has conducted workshops in the field of deep learning across the world and has spoken at a number of data science conferences. He is a strong believer of open source and loves to architect big, fast, and reliable systems. In his free time, he enjoys music, traveling, and meeting new people.

Presentations

Making architecture choices for small and big data problems Session

Not all data science problems are big data problems. Lots of small and medium product companies want to start their journey to become data driven. Nischal HP and Raghotham Sripadraj share their experience building data science platforms for various enterprises, with an emphasis on making the right architecture choices and using distributed and fault tolerant tools.

Karen Hsu is head of growth at BlockCypher. Karen has over 20 years of experience in technology, with a focus on business intelligence, fintech, and the blockchain, and has worked in a variety of engineering, marketing, and sales roles to bring new products to market. She has coauthored four patents. Karen holds a BS in management science and engineering from Stanford University.

Presentations

Spark, GraphX, and blockchains: Building a behavioral analytics platform for forensics, fraud, and finance Session

Bryan Cheng and Karen Hsu describe how they built machine-learning and graph traversal systems on Apache Spark to help government organizations and private businesses stay informed in the brave new world of blockchain technology. Bryan and Karen also share lessons learned combining these two bleeding-edge technologies and explain how these techniques can be applied to private and federated chains.

Grace Huang is a data scientist at Pinterest.

Presentations

Building a sustainable content ecosystem at Pinterest Session

With over 75 billion pins, the Pinterest content corpus is one of the largest human-curated collection of ideas. Grace Huang walks you through the life-cycle of a piece of content in Pinterest, a portfolio of metrics developed to monitor the health of the content corpus, and the story of creating a cross-functional initiative to preserve a healthy, sustainable content ecosystem.

Tim Hunter is a software engineer at Databricks and contributes to the Apache Spark MLlib project. Tim holds a PhD from UC Berkeley, where he built distributed machine-learning systems starting with Spark version 0.2.

Presentations

Best practices for deep learning on Apache Spark Session

Joseph Bradley and Tim Hunter share best practices for building deep learning pipelines with Apache Spark, covering cluster setup, data ingest, tuning clusters, and monitoring jobs—all demonstrated using Google’s TensorFlow library.

Alysa Z. Hutnik is a partner in the Advertising & Marketing and Privacy & Information Security practices at Kelley Drye & Warren LLP in Washington, DC. Her practice represents clients in all forms of consumer-protection matters, from counseling to defending regulatory investigations and litigation. Alysa’s specific focus is on privacy, data security, and advertising law, including unfair and deceptive practices, electronic and mobile commerce, and data sharing. Alysa is past chair of the ABA’s Privacy and Information Security Committee (Section of Antitrust), the cochair of the section’s 2011 Consumer Protection Conference, and the editor-in-chief of the ABA’s Data Security Handbook, a practical guide for data-security legal practitioners. To find out more about Alysa and Kelley Drye & Warren LLP, visit KelleyDrye.com, subscribe to the AdLawAccess.com blog, or find Kelley Drye on Facebook.

Presentations

Doing data right: Legal best practices for making your data work Session

Big data promises enormous benefits for companies, and new innovations in this space only mean more data collection is required. Having a solid understanding of legal obligations will help you avoid the legal snafus that can come with collecting big data. Alysa Hutnik and Crystal Skelton outline legal best practices and practical tips to avoid becoming a big data “don’t.”

Mario Inchiosa’s passion for data science and high-performance computing drives his work at Microsoft, where he focuses on delivering parallelized, scalable advanced analytics integrated with the R language. Previously, Mario served as Revolution Analytics’s chief scientist and as analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R. Prior to that, Mario was US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances. He also served as US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining, and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning Publication of the Year and Open Literature Publication Excellence awards.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Anand Iyer is a senior product manager at Cloudera, the leading vendor of open source Apache Hadoop. His primary areas of focus are platforms for real-time streaming, Apache Spark, and tools for data ingestion into the Hadoop platform. Before joining Cloudera, Anand worked as an engineer at LinkedIn, where he applied machine-learning techniques to improve the relevance and personalization of LinkedIn’s Feed. Anand has extensive experience leveraging big data platforms to deliver products that delight customers. He holds a master’s in computer science from Stanford and a bachelor’s from the University of Arizona.

Presentations

Practical considerations for running Spark workloads in the cloud Session

Both Spark workloads and use of the public cloud have been rapidly gaining adoption in mainstream enterprises. Anand Iyer and Philip Langdale discuss new developments in Spark and provide an in-depth discussion on the intersection between the latest Spark and cloud technologies.

Matthew Jacobs is a software engineer at Cloudera working on Impala.

Presentations

Deploying and operating big data analytic apps on the public cloud Tutorial

Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Romit Jadhwani is business intelligence lead at Pinterest, where he is helping to improve business operations by enabling data analytics, data science, and visualization. Romit has over 10 years of experience in business intelligence and analytics across technology, online advertising, telecommunications, and financial services industries. He loves solving challenges at scale and is passionate about unlocking the maximum business potential of technology assets. Previously, Romit led a BI team at Google focused on financial analytics for Google’s advertising products. He holds a graduate degree in computer science.

Presentations

How Pinterest scaled to build the world’s catalog of 75+ billion ideas Session

Over the course of just six years, Pinterest has helped over 100 million pinners discover and collect over 75+ billion ideas to plan their everyday lives. Romit Jadhwani walks you through the different phases of this hypergrowth journey and explores the focuses, thought processes, and decisions of Pinterest’s data team as they scaled and enabled this growth.

Prakhar Jain is a member of the technical staff at Qubole, where he works on the cluster orchestration stack. Prakhar holds a bachelor of computer science engineering from the Indian Institute of Technology, Bombay, India.

Presentations

Moving big data as a service to a multicloud world Session

Qubole started out by offering Hadoop as a service in AWS. Over time, it extended its big data capabilities beyond Hadoop and its cloud infrastructure support beyond AWS. Sriram Ganesan and Prakhar Jain explain how and why Qubole built Cloudman, a simple, cloud-agnostic, multipurpose provisioning tool that can be extended for further engines and further cloud support.

Chandan Joarder is the principal engineer at Macy’s, where his team is responsible for data science integration for Macys.com. Chandan’s management of the systems behind Macy’s online presence have led to significant strides in incorporating real-time analytics into the company’s endeavors.

Presentations

Redefining online retail in real time with Kafka and Spark Session

Chandan Joarder provides the definitive guide to building real-time dashboards using tools such as Kafka, Spark, and an in-memory database. By breaking down the architectural principles Macy’s uses in its dashboards to provide an up-to-the-hour, 360-degree view of customer data, Chandan demonstrates how real time is directly impacting the retail industry.

Presentations

Building the “future you” retirement planning service on a Hadoop data lake Tutorial

Chris Murphy and Emma Jones explain how leading global insurer Zurich adopted Hadoop to leverage its data to underpin the new customer-centric ethos of the organization and how this has enabled a new approach to helping customers truly understand their financial portfolios, build a roadmap to meet their financial goals, identify opportunities, and help them secure their financial future.

Dirk Jungnickel is a senior vice president heading the central Business Analytics and Big Data function of Emirates Integrated Telecommunications Company (du), which integrates data warehousing, big data platforms, BI tools, data governance, business intelligence, and advanced analytics capabilities. Following an academic career in theoretical physics, with more than seven years of postdoctoral research, he has spent 17 years in telecommunications. A seasoned telecommunications executive, Dirk has held a variety of roles in international firms, including senior IT and IT architecture roles, various program management and business intelligence positions, the head of corporate PMO, and an associate partner with a global management and strategy consulting firm.

Presentations

Data monetization: A telecommunications use case Tutorial

Ben Sharma and Dirk Jungnickel discuss how Dubai-based telco leader du leverages big data to create smart cities and enable location-based data monetization, covering business objectives and outcomes and addressing technical and analytical challenges.

Amanda Kahlow is the CEO and founder of 6sense, the leading B2B predictive intelligence solution, which has raised $36 million from top investors like Bain Capital, Venrock, and Salesforce and secured million dollar deals with global brands such as Cisco, IBM, Dropbox, and Dell. Amanda is a long-time player and leader in B2B marketing and was included in SF Business Times’s “40 Under 40” and Fierce CMO “Women to Watch” lists in 2015. Previously, Amanda spent 14 years as the CEO and founder of CI Insights, a big-data services company that used multichannel analytics to help enterprise companies generate hundreds of millions in net-new business.

Presentations

Inside predictive intelligence, the powerful technology disrupting sales and marketing Session

What if companies could predict what products people will buy, how much they will buy, and when? With predictive intelligence, they can. Amanda Kahlow and Mike Mansbach dive into the technology behind how BlueJeans Network was able to leverage predictive analytics to uncover buyers earlier, convert them at a 20x higher rate, and build a $33M pipeline.

David Kale is a deep learning engineer at Skymind and a PhD candidate in computer science at the University of Southern California (advised by Greg Ver Steeg of the USC Information Sciences Institute). David’s research uses machine learning to extract insights from digital data in high-impact domains, such as healthcare. Recently, he has pioneered the application of recurrent neural nets to modern electronic health records data. At Skymind, he is developing the ScalNet Scala API for DL4J and working on model interoperability between DL4J and other major frameworks. David organizes the Machine Learning and Healthcare Conference (MLHC), is a cofounder of Podimetrics, and serves as a judge in the Qualcomm Tricorder XPRIZE competition. David is supported by the Alfred E. Mann Innovation in Engineering Fellowship.

Presentations

Scalable deep learning for the enterprise with DL4J Tutorial

Dave Kale, Melanie Warwick, Susan Eraly, and Josh Patterson explain how to build, train, and deploy neural networks using Deeplearning4j. Topics include the fundamentals of deep learning, ND4J and DL4J, and scalable training using GPUs and Apache Spark. You'll gain hands-on experience with several models, including convolutional and recurrent neural nets.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

Intelligent pattern profiling on semistructured data with machine learning Session

It's well known that data analysts spend 80% of their time preparing data and only 20% analyzing. In order to change that ratio, organizations must build tools specifically designed for working with ad hoc (semistructured) data. Sean Kandel and Michael Minar discuss the development of a new technique leveraging machine learning to discover and profile the inherent structure in ad hoc datasets.

Why the next wave of data lineage is driven by automation, visualization, and interaction Session

Sean Kandel and Wei Zheng offer an overview of an entirely new approach to visualizing metadata and data lineage, demonstrating automated methods for detecting, visualizing, and interacting with potential anomalies in reporting pipelines. Join in to learn what’s required to efficiently apply these techniques to large-scale data.

Chris Kang is an R&D associate principal within the Systems and Platforms research group at Accenture Labs. His current work is focused on model management for analytical models in big data streaming architectures. Chris’ roles at Accenture have varied from software developer to data engineer. Previously, Chris was a cloud architect with an emphasis on private cloud technologies.

Presentations

DevOps for models: How to manage millions of models in production Session

As Accenture scaled from thousands to millions of predictive models, it needed automation to manage models at scale, ensure accuracy, prevent false alarms, and preserve trust as models are created, tested, and deployed into production. Teresa Tung, Jürgen Weichenberger, and Chris Kang discuss embracing DevOps for models by employing a self-healing approach to model life-cycle management.

Holden Karau is a software development engineer at IBM and is active in open source. Prior to IBM, she worked on a variety of big data, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. Holden is the author of Learning Spark and has assisted with Spark workshops. She graduated from the University of Waterloo with a bachelors of mathematics in computer science.

Presentations

Debugging Apache Spark Session

Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau and Rachel Warren explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.

Spark Structured Streaming for machine learning Session

Structured Streaming is new in Apache Spark 2.0 and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model.

Andra Keay is the managing director of Silicon Valley Robotics, an industry group supporting innovation and commercialization of robotics technologies. Andra is also founder of Robot Launch, global robotics startup competition, cofounder of Robot Garden hackerspace, mentor at hardware accelerators, a startup advisor, and an active angel investor in robotics startups. Andra is also a director at Robohub.org, the global site for news and views on robotics. Previously Andra was an ABC film, television, and radio technician and taught interaction design at the University of Technology, Sydney. Andra has keynoted at major international conferences, including USI 2016, WebSummit 2014 and 2015, Collision 2015 and 2016, Pioneers Festival 2014, JavaOne 2014, Solid 2014, and SxSW 2015. She was selected as an HRI Pioneer in 2010. Andra holds a BA in communication from the University of Technology, Sydney, Australia, and an MA in human-robot culture from the University of Sydney, Australia, where her work built on her background as a robot geek, STEM educator, and filmmaker.

Presentations

Making good robots Keynote

Let’s stop talking about bad robots and start talking about what makes a robot good. A good or ethical robot must be carefully designed. Andra Keay outlines five principles of good robot design and discusses the implications of implicit bias in our robots.

Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where developed and open sourced techniques for anomaly detection and breakout detection. Prior research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Anomaly detection in real-time data streams using Heron Session

Anomaly detection plays a key role in the context of analysis of real-time streams. This is exemplified by, say, detection incidents in real life from tweet storms. Arun Kejariwal and Karthik Ramasamy walk you through how anomaly detection is supported in real-time data streams in Heron—the streaming system built in-house at Twitter (and open sourced) for real-time computation.

Phil Keslin is the Chief Technology Officer and Founder of Niantic, Inc., creators of Pokemon GO. Started in early 2011 and Incubated within Google, Phil led the engineering team in exploring the convergence of mobile, geo and social within a range of applications leading up to the launch of Ingress and FieldTrip.

Prior to Niantic, Phil was a contributor to StreetView, GMail and Lively products while at Google. Additionally Phil was a GPU Architect at NVIDIA and a key contributor to the design and development several of their GPUS.

In 2000, Phil joined up with John Hanke to found Keyhole. As its CTO, Phil led the development of the Earthviewer application which would later become Google Earth following the company’s acquisition by Google.

Phil received his MBA from Southern Methodist University and his Bachelors in Computer Science from the University of Texas at Austin.

Presentations

Keynote with Phil Keslin Keynote

Phil Keslin, CTO & Founder, Niantic, Inc.

Dale Kim is the senior director of industry solutions at MapR. His background includes a variety of technical and management roles at information technology companies. While Dale’s experience includes work with relational databases, much of his career pertains to nonrelational data in the areas of search, content management, and NoSQL and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University, and a BA in computer science from the University of California, Berkeley.

Presentations

Architectural considerations for building big data applications in the cloud Session

Big data applications in the cloud are becoming more about the global distribution and access of data than about easier deployments. Dale Kim shares insights on architecting big data applications for the cloud, using an example reference application his team built and published as context for describing several key requirements for cloud-based environments.

Kenn Knowles is a founding committer of Apache Beam (incubating). Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.

Presentations

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating) Session

Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow.

Mike Koelemay runs the Data Science team within Advanced Analytics at Sikorsky, where he is responsible for bringing state-of-the-art analytics and algorithm technologies to support the ingestion, processing, and serving of data collected onboard thousands of aerospace assets around the world. Drawing on his 10+ years of experience in applied data analytics for integrated system health management technologies, Mike works with other software engineers, data architects, and data scientists to support the execution of advanced algorithms, data mining, signal processing, system optimization, and advanced diagnostics and prognostics technologies, with a focus on rapidly generating information from large, complex datasets.

Presentations

Where data science meets rocket science: Data platforms and predictive analytics for aerospace Session

Sikorsky collects data onboard thousands of helicopters deployed worldwide that is used for fleet management services, engineering analyses, and business intelligence. Mike Koelemay offers an overview of the data platform that Sikorsky has built to manage the ingestion, processing, and serving of this data so that it can be used to rapidly generate information to drive decision making.

Daphne Koller is an Israeli-American Professor in the Department of Computer Science at Stanford University and a MacArthur Fellowship recipient. She is also one of the founders of Coursera, an online education platform

Presentations

Keynote with Daphne Koller Keynote

Daphne Koller, Professor in the Department of Computer Science at Stanford University

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.

Presentations

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker offer an introduction to using Impala + Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Anirudh Koul is a data scientist at Microsoft. Anirudh brings a decade of applied research experience on petabyte-scale social media datasets, including Facebook, Twitter, Yahoo Answers, Quora, Foursquare, and Bing. He has worked on a variety of machine-learning, natural language processing, and information retrieval-related projects at Yahoo, Microsoft, and Carnegie Mellon University. Adept at rapidly prototyping ideas, Anirudh has won over two dozen innovation, programming, and 24-hour hackathons organized by companies including Facebook, Google, Microsoft, IBM, and Yahoo. He was also the keynote speaker at the 2014 SMX conference in Munich, where he spoke about trends in applying machine learning on big data.

Presentations

Squeezing deep learning onto mobile phones Session

Over the last few years, convolutional neural networks (CNN) have risen in popularity, especially in computer vision. Anirudh Koul explains how to bring the power of deep learning to memory- and power-constrained devices like smartphones and drones.

Jay Kreps is the cofounder and CEO of Confluent, a company focused on Apache Kafka. Previously, Jay was one of the primary architects for LinkedIn, where he focused on data infrastructure and data-driven products. He was among the original authors of a number of open source projects in the scalable data systems space, including Voldemort (a key-value store), Azkaban, Kafka (a distributed messaging system), and Samza (a stream processing system).

Presentations

The rise of real time: Apache Kafka and the streaming revolution Session

The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the union for stream processing, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Jay Kreps explores the future of Apache Kafka and the stream processing ecosystem.

Srini Kumar is a director of data science in the Algorithms and Data Science group at Microsoft. He works with strategic customers in the areas of Cortana Analytics and Microsoft R Server. Prior to joining Microsoft, Srini headed product management for the Information Management (EIM) Product Suite at SAP; originated and architected a product on HANA to analyze human genome variants, which led to a discovery relating diabetes to a person’s origin and resulted in two patent applications related to modeling genomic variants and a recent one related to enterprise information management; and helped turn around and sell a startup in the area of on-demand supply chain management software. Srini holds a master’s degree in industrial engineering from the University of Wisconsin-Madison and a bachelor’s degree in mechanical engineering from the Indian Institute of Technology, Chennai.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Scott Kurth is the vice president of advisory services at Silicon Valley Data Science, where he helps clients define and execute the strategies and data architectures that enable differentiated business growth. Building on 20 years of experience making emerging technologies relevant to enterprises, he has advised clients on the impact of technological change, typically working with CIOs, CTOs, and heads of business. Scott has helped clients drive global technology strategy, conduct prioritization of technology investments, shape alliance strategy based on technology, and build solutions for their businesses. Previously, Scott was director of the Data Insights R&D practice within the Accenture Technology Labs, where he led a team focused on employing emerging technologies to discover the insight contained in data and effectively bring that insight to bear on business processes to enable new and better outcomes and even entire new business models, and led the creation of Accenture’s annual analysis of emerging technology trends impacting the future of IT, Technology Vision, where he was responsible for tracking emerging technologies, analyzing their transformational potential, and using it to influence technology strategy for both Accenture and its clients.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business.

Dwai Lahiri is a long-term IT Infrastructure guy. Dwai is a senior solutions architect at Cloudera, where he works with Cloudera’s hardware and private and public cloud partners, enabling them to run Cloudera’s Enterprise EDH stack on their respective platforms.

Presentations

How to leverage your private cloud infrastructure to deploy Hadoop Session

Dwai Lahiri explains how to leverage private cloud infrastructure to successfully build Hadoop clusters and outlines dos, don'ts, and gotchas for running Hadoop on private clouds.

Phil Langdale is currently a lead engineer at Cloudera working on Cloudera Manager. He has worked on all versions of Cloudera Manager since inception.

Presentations

Practical considerations for running Spark workloads in the cloud Session

Both Spark workloads and use of the public cloud have been rapidly gaining adoption in mainstream enterprises. Anand Iyer and Philip Langdale discuss new developments in Spark and provide an in-depth discussion on the intersection between the latest Spark and cloud technologies.

Brian Lange is a data scientist at Datascope, where he leads design process exercises and works on algorithms, web interfaces, and visualizations. Brian has contributed to projects for P&G, Thomson Reuters, Motorola, and other well-known companies, and his work has been featured on Nathan Yau’s FlowingData. While he’s not nerding out about typography and machine-learning techniques, Brian enjoys science and comedy podcasts, brewing beer, and listening to weird music.

Presentations

The perfect conference: Using stochastic optimization to bring people together Session

The goal of RCSA's Scialog conferences is to foster collaboration between scientists with different specialties and approaches, and, working with Datascope, the company has been doing so in a quantitative way for the last six years. Brian Lange discusses how Datasope and RCSA arrived at the problem, the design choices made in the survey and optimization, and how the results were visualized.

Once upon a time, Bill Lattner was a civil engineer. Now, he is a data scientist on the R&D team at Civis Analytics, where he spends most of his time writing tools for other data scientists, primarily in Python but also in R and occasionally Go. Prior to joining Civis, Bill was at Dishable, working on recommender systems and predicting dinning habits of Chicagoans.

Presentations

The power of persuasion modeling Session

How do we know that an advertisement or promotion truly drives incremental revenue? Michelangelo D'Agostino and Bill Lattner share their experience developing machine-learning techniques for predicting treatment responsiveness from randomized controlled experiments and explore the use of these “persuasion” models at scale in politics, social good, and marketing.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Pig. Julien is an architect at Dremio and was previously the tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

The future of column-oriented data processing with Arrow and Parquet Session

In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, such as RDMA, SSDs, and nonvolatile memory.

Mike Lee Williams is Director of Research at Fast Forward Labs, an applied machine intelligence lab in New York City. He builds prototypes that bring the latest ideas in machine learning and AI to life, and works with Fast Forward Labs’s clients to help them understand how to make use of these new technologies. He has a PhD in astrophysics from Oxford.

Presentations

Learning from incomplete, imperfect data with probabilistic programming Session

Real-world data is incomplete and imperfect. The right way to handle it is with Bayesian inference. Michael Williams demonstrates how probabilistic programming languages hide the gory details of this elegant but potentially tricky approach, making a powerful statistical method easy and enabling rapid iteration and new kinds of data-driven products.

Bob Lehmann is an architect on the Data Platform team at Monsanto, where he leads efforts to both modernize enterprise technology and transition to the cloud. Bob has held a number of positions in IT and engineering working with data ranging from high-volume sensor data to enterprise data (and everything in between). He holds a master’s degree in electrical engineering from Missouri University of Science and Technology.

Presentations

Stream me up, Scotty: Transitioning to the cloud using a streaming data platform Session

Gwen Shapira and Bob Lehmann share their experience and patterns building a cross-data-center streaming data platform for Monsanto. Learn how to facilitate your move to the cloud while "keeping the lights on" for legacy applications. In addition to integrating private and cloud data centers, you'll discover how to establish a solid foundation for a transition from batch to stream processing.

Jure Leskovec is chief scientist at Pinterest and associate professor of computer science at Stanford University. Jure’s research focuses on computation over massive data and has applications in computer science, social sciences, economics, marketing, and healthcare. This research has won several awards including a Lagrange Prize, Microsoft Research Faculty Fellowship, the Alfred P. Sloan Fellowship, and numerous best paper awards. Jure holds a bachelor’s degree in computer science from the University of Ljubljana, Slovenia, and a PhD in machine learning from Carnegie Mellon University and undertook postdoctoral training at Cornell University.

Presentations

Recommending 1+ billion items to 100+ million users in real time: Harnessing the structure of the user-to-object graph to extract ranking signals at scale Session

Pinterest built a flexible, graph-based system for making recommendations to users in real time. The system uses random walks on a user-and-object graph in order to make personalized recommendations to 100+ million Pinterest users out of a catalog of over a billion items. Jure Leskovec explains how Pinterest built its modern recommendation engine and the lessons learned along the way.

Haoyuan Li is founder and CEO of Alluxio (formerly Tachyon Nexus), a memory-speed virtual distributed storage system. Before founding the company, Haoyuan was working on his PhD at UC Berkeley’s AMPLab, where he cocreated Alluxio. He is also a founding committer of Apache Spark. Previously, he worked at Conviva and Google. Haoyuan holds an MS from Cornell University and a BS from Peking University.

Presentations

Alluxio (formerly Tachyon): The journey thus far and the road ahead Session

Alluxio (formerly Tachyon) is an open source memory-speed virtual distributed storage system. The project has experienced a tremendous improvement in performance and scalability and was extended with key new features. Haoyuan Li and Gene Pang explore Alluxio's goal of making its product accessible to an even wider set of users, through a focus on security, new language bindings, and APIs.

Li Yang is cofounder and CTO of Kyligence as well as a cocreator and PMC member of Apache Kylin. As the tech lead and architect of Kylin, Yang focuses on big data analysis, parallel computation, data indexing, relational algebra, approximation algorithms, and other technologies. Previously, he was senior architect of eBay’s Analytic Data Infrastructure department; tech lead of IBM’s InfoSphere BigInsights, where he was responsible for the Hadoop open source platform and winner of Outstanding Technical Achievement award; and a vice president at Morgan Stanley, responsible for the global regulatory reporting platform.

Presentations

Apache Kylin 2.0: From classic OLAP to real-time data warehouse Session

Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.

Eric Liang is a software engineer at Databricks with broad interests in systems and data analytics. Before Databricks, Eric worked in the Storage Infrastructure group at Google. He holds a BS in electrical engineering and computer science from the University of California, Berkeley.

Presentations

How Spark can fail or be confusing and what you can do about it Session

Just like any six-year-old, Apache Spark does not always do its job and can be hard to understand. Michael Armbrust and Eric Liang look at the top causes of job failures customers encountered in production and examine ways to mitigate such problems by modifying Spark. They also share a methodology for improving resilience: a combination of monitoring and debugging techniques for users.

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine-learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Presentations

Apache Kudu: 1.0 and beyond Session

Todd Lipcon offers a very brief refresher on the goals and feature set of the Kudu storage engine, covering the development that has taken place over the last year, including new features such as improved support for time series workloads, performance improvements, Spark integration, and highly available replicated masters.

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker offer an introduction to using Impala + Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

Paige Liu is a software developer with Microsoft. Paige has been involved in the development of a wide range of diverse applications and services, from web applications to large-scale multitier distributed systems to hyper-scale search engine backends. While most of her experience is with Microsoft technology, Paige has also developed and released cross-platform solutions including Java APM (Application Performance Monitoring) and Linux Unix monitoring systems. Recently, she has been focusing on cloud computing, specifically with the Microsoft Azure cloud, helping enterprises to develop new applications in the cloud or move their existing workload to the cloud.

Presentations

Running a Cloudera cluster in production on Azure Session

Paige Liu explores the options and trade-offs to consider when building a Cloudera cluster on Microsoft Azure cloud and explains how to deploy and scale a Cloudera cluster on Azure and how to connect a Cloudera cluster with other Azure services to build enterprise grade end-to-end big data solutions.

Julie is the cofounder of Chyber Corporation, a startup focused on improving care and education plans for children with special needs. Julie has held several executive positions at InterSystems, Informatica, and EMC and was an analyst with ESG. She has an MBA from MIT and a BSEE from WPI.

Presentations

Individualized care driven by wearable data and real-time analytics Session

How can we empower individuals with special needs to reach their potential? Julie Lockner offers an overview of a project to develop collaboration applications that use wearable device data to improve the ability to develop the best possible care and education plans. Join in to learn how real-time IoT data analytics are making this possible.

Andre Luckow is a researcher, technology enthusiast, and project manager for BMW Group currently living in Greenville, South Carolina. Andre’s interests range from technology to programming. He’s also interested in travel and innovation.

Presentations

Deep learning in the automotive industry: Applications and tools Tutorial

Andre Luckow shares best practices for developing and deploying deep learning solutions in the automotive industry and explores different deep learning frameworks, including TensorFlow, Caffe, and Torch, and deep neural network architectures, evaluating their trade-offs in terms of classification performance, training, and inference time.

Maura Lynch is a product manager at Pinterest. Before joining product, she worked in analytics both at Pinterest and in gaming for several years. Maura started her career in research in physics at Duke and in economics at the Federal Reserve.

Presentations

New user recommendations at scale: Identifying compelling content for low-signal users using a hybrid-curation approach Tutorial

New users are the most delicate for any service. Nailing their first experience with your product is essential to growing your user base. Maura Lynch offers an overview of Pinterest's hybrid-curation approach to creating compelling content streams for users when there is very little signal as to their preferences.

Roger Magoulas is the research director at O’Reilly Media and chair of the Strata + Hadoop World conferences. Roger and his team build the analysis infrastructure and provide analytic services and insights on technology-adoption trends to business decision makers at O’Reilly and beyond. He and his team find what excites key innovators and use those insights to gather and analyze faint signals from various sources to make sense of what others may adopt and why.​

Presentations

Thursday Opening Welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday Opening Welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Rajiv Maheswaran is CEO of Second Spectrum, an innovative sports analytics and data visualization startup located in Los Angeles, California. His work spans the fields of data analytics, data visualization, real-time interaction, spatiotemporal pattern recognition, artificial intelligence, decision theory, and game theory. Previously, Rajiv served as a research assistant professor within the University of Southern California’s Department of Computer Science and a project leader at the Information Sciences Institute at the USC Viterbi School of Engineering. He and Second Spectrum COO Yu-Han Chang codirected the Computational Behavior Group at USC. Rajiv has received numerous awards and written over 100 publications in artificial intelligence and control theory. Rajiv won the 2014 USC Viterbi School of Engineering Use-Inspired Research Award as well as both the 2014 and 2012 Best Research Paper (Alpha Award) at the renowned MIT Sloan Sports Analytics Conference. He is a frequent speaker at marquee technology conferences and events around the world. Rajiv holds a BS in applied mathematics, engineering, and physics from the University of Wisconsin-Madison and an MS and PhD, both in electrical and computer engineering, from the University of Illinois at Urbana-Champaign.

Presentations

Keynote with Rajiv Maheswaran Keynote

Details to come.

Ted has 18 years of professional experience working for startups, the US government, some of the world’s largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the largest nonprofit financial regulator in the US and has worked on close to one hundred clusters for over two dozen clients with over hundreds of use cases. He has architecture experience across topics including Hadoop, Web 2.0, mobile, SOA (ESB, BPM), and big data. Ted is a regular contributor to the Hadoop, HBase, and Spark projects, a regular committer to Flume, Avro, Pig, and YARN, and the coauthor of O’Reilly Media’s Hadoop Application Architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

James Malone is a product manager for Google Cloud Platform and manages Cloud Dataproc and Apache Beam (incubating). Previously, James worked at Disney and Amazon. James is a big fan of open source software because it shows what is possible when people come together to solve common problems with technology. He also loves data, amateur radio, Disneyland, photography, running, and Legos.

Presentations

Architecting and building enterprise-class Spark and Hadoop in cloud environments Tutorial

James Malone explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem.

Mike Mansbach is the president of BlueJeans Network. As an industry veteran, Mike has a demonstrated track record of driving results, delivering exceptional customer experiences, and preparing companies for meteoric growth. Previously, he was CEO of PunchTab (acquired by Walmart in 2015) and worked at Citrix, where he was instrumental in growing Citrix Systems’ GoTo product line, including GoToMeeting. Under his leadership, Citrix focused on delivering great customer value and SaaS revenues grew from $40M to over $500M.

Presentations

Inside predictive intelligence, the powerful technology disrupting sales and marketing Session

What if companies could predict what products people will buy, how much they will buy, and when? With predictive intelligence, they can. Amanda Kahlow and Mike Mansbach dive into the technology behind how BlueJeans Network was able to leverage predictive analytics to uncover buyers earlier, convert them at a 20x higher rate, and build a $33M pipeline.

Kevin Mao is a senior data engineer at Capital One Financial Services currently working on the Cybersecurity Data Lake team within Capital One’s Enterprise Data Services organization. Kevin’s current work involves designing and developing tools to ingest and transform cybersecurity-related data streams from across the organization into datasets that are used by security analysts for detecting and forecasting cyberthreats. Kevin holds a BS in computer science from the University of Maryland, Baltimore County, and an MS in computer science from George Mason University. In his free time, he enjoys hiking, running, climbing, and snowboarding.

Presentations

Achieving real-time ingestion and analysis of security events through Kafka and Metron Session

Kevin Mao explores the value of and challenges associated with collecting raw security event data from disparate corners of enterprise infrastructure and transforming them into high-quality intelligence that can be used to forecast, detect, and mitigate cybersecurity threats.

Manish Marwah is a senior researcher in Hewlett Packard Labs. His research interests include data mining, machine learning, big data, energy analytics, computational sustainability, cyber physical systems, the internet of things, visual analytics, data centers, and the dependability of systems. Currently, Manish is focused on large-scale graph analytics for the IoT. Previously, he worked for several years in the telecommunication industry, most recently as an architect with Avaya Labs. His research has led to more than 60 refereed papers in conferences and journals and resulted in 23 issued patents. Manish holds a PhD in computer science from the University of Colorado at Boulder.

Presentations

Malicious site detection with large-scale belief propagation Session

Alexander Ulanov and Manish Marwah explain how they implemented a scalable version of loopy belief propagation (BP) for Apache Spark, applying BP to the large web-crawl data to infer the probability of websites to be malicious. Applications of BP include fraud detection, malware detection, computer vision, and customer retention.

Desi Matel-Anderson is the chief wrangler of the Field Innovation Team (FIT) and CEO of the Global Disaster Innovation Group, LLC. FIT has deployed teams to several disasters including the Boston Marathon bombings, assisting at the scene with social media analysis; the Moore, Oklahoma, tornadoes, leading coding solutions; Typhoon Haiyan in the Philippines, through the building of cellular connectivity heat maps; and the Oso, Washington, mudslides, with unmanned aerial system flights, which resulted in a 3D print of the topography for incident command. The team also deploys to humanitarian crises, which have included running a robot petting zoo at the US/Mexico border and leading a women’s empowerment recovery movement after the Nepal earthquakes. Recently, her team deployed to Lebanon for the Syrian refugee crisis, supporting artificial intelligence for access to health care, establishing the power grid, and empowering refugees through evacuation routes utilizing 360-degree virtual reality capture video.

Previously, Desi was the first chief innovation advisor at FEMA, where she led the innovation team to areas affected by Hurricane Sandy to provide real-time problem solving in disaster response and recovery and ran think tanks nationwide to cultivate innovation in communities. Desi’s emergency management experience began when she volunteered in Northern Illinois University’s Office of Emergency Planning. She then worked with the Southeast Wisconsin Urban Area Security Initiative and the City of Milwaukee Office of Emergency Management. In addition to her regional emergency management duties, she worked as a nationwide assessor of the Emergency Management Accreditation Program. Desi lectures on innovation at Harvard, Yale, UC Berkeley, and several other universities across the country and serves as consultant on innovative practices and infrastructure for agencies and governments, nationally and internationally. Desi attended the National Preparedness Leadership Institute at Harvard’s Kennedy School of Government and School of Public Health and served on the advisory board of Harvard’s National Preparedness Leadership Institute in 2013. She holds a Juris Doctorate from Northern Illinois University.

Presentations

Data in disasters: Saving lives and innovating in real time Keynote

Data to the rescue. Desi Matel-Anderson offers an immersive deep dive into the world of the Field Innovation Team, who routinely find themselves on the frontier of disasters working closely with data to save lives, at times while risking their own.

Dr. Mendez-Costabel holds an agronomy degree (BSc) from the National University of Uruguay, and two Viticulture degrees, one from The University of California at Davis (MSc) and the other from The University of Adelaide in Australia (PhD). He has worked over twelve years in the agricultural sector covering a wide range of precision agriculture-related roles, including data scientist and GIS Manager for E&J Gallo Winery in California. Currently Dr. Mendez-Costabel leads the geospatial data asset team at Monsanto’s Products and Engineering organization within the IT department, where he drives the engineering and adoption of global geospatial data assets for the enterprise.

Presentations

The enterprise geospatial platform: A perfect fusion of cloud and open source technologies Session

The need to process geospatial data has grown in leaps and bounds at Monsanto. Recently, the volume of data collected from farmers' fields via sensors, rovers, drones, in-cabin technologies, and other sources has forced Monsanto to rethink its geospatial processing capabilities. Naghman Waheed explains how Monsanto built a scalable geospatial platform using cloud and open source technologies.

Stephen Merity is a senior software engineer at MetaMind (recently acquired by Salesforce), where he works on researching and implementing deep learning models for vision and text, with a focus on memory networks and neural attention mechanisms for computer vision and natural language processing tasks. Previously, Stephen worked on big data at Common Crawl, data analytics at Freelancer.com, and online education at Grok Learning. Stephen holds a master’s degree in computational science and engineering from Harvard University and a bachelor of information technology from the University of Sydney.

Presentations

The frontiers of attention and memory in neural networks Session

While attention and memory have become important components in many state-of-the-art deep learning architectures, it's not always obvious where they may be most useful. Even more challenging, such models can be very computationally intensive for production. Stephen Merity discusses the most recent techniques, what tasks they show the most promise in, and when they make sense in production systems.

Michael Minar is a principal software engineer at Trifacta, where he develops solutions to bring intelligence to the data preparation process. His main area of focus at Trifacta is machine learning for visual data transformation and discovery. Previously, Michael was the director of analytics at Euclid Analytics. He holds a PhD in applied physics from Stanford University.

Presentations

Intelligent pattern profiling on semistructured data with machine learning Session

It's well known that data analysts spend 80% of their time preparing data and only 20% analyzing. In order to change that ratio, organizations must build tools specifically designed for working with ad hoc (semistructured) data. Sean Kandel and Michael Minar discuss the development of a new technique leveraging machine learning to discover and profile the inherent structure in ad hoc datasets.

Mostafa Mokhtar is a performance engineer at Cloudera. Previously, he held similar roles at Hortonworks and on the SQL Server team at Microsoft.

Presentations

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Rajat Monga leads TensorFlow, an open source machine-learning library and the center of Google’s efforts at scaling up deep learning. He is one of the founding members of the Google Brain team and is interested in pushing Machine Learning research forward towards General AI. Prior to Google, as the chief architect and director of engineering at Attributor, Rajat led the labs and operations and built out the engineering team. A veteran developer, Rajat has worked at eBay, Infosys, and a number of startups.

Presentations

The state of TensorFlow today and where it is headed in 2017 Session

Rajat Monga offers an overview of TensorFlow progress and adoption in 2016 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas.

Gleicon Moraes is delivery and operations manager at 7co.cc. He uses Python, Go, and Erlang and loves distributed systems, nonrelational databases, and OSS.

Presentations

Building a recommender from a big behavior graph over Cassandra Session

Gleicon Moraes and Arthur Grava share war stories about developing and deploying a cloud-based large-scale recommender system for a top-three Brazilian ecommerce company. The system, which used Cassandra and graph traversal, led to a more than 15% increase in sales.

Todd Mostak is the founder and CEO of MapD, a pioneer in building GPU-tuned analytics and visualization applications for the enterprise. Previously, Todd was a research fellow at the MIT’s Computer Science and Artificial Intelligence Laboratory, where he focused on GPU databases and visualization. Todd conceived of the idea of using GPUs to accelerate the extraction of insights from large datasets while conducting graduate research on the role of Twitter in the Arab Spring. Frustrated by the capabilities of conventional technologies to allow for the interactive exploration of these multimillion row datasets, Todd built one of the first GPU-based databases. Todd holds an MA in Middle Eastern studies from Harvard.

Presentations

From hours to milliseconds: How Verizon accelerated its mobile analytics Session

With more than 91M customers, Verizon produces oceans of data. The challenge this onslaught presents isn’t one of storage—it’s one of speed. The solution? Harnessing the power of GPUs to access insights in less than a millisecond. Todd Mostak and Abdul Subhan explain how Verizon solved its data challenge by implementing GPU-tuned analytics and visualization.

John Mount is a principal consultant at Win-Vector LLC, a San Francisco data science consultancy. John has worked as a computational scientist in biotechnology and a stock-trading algorithm designer and has managed a research team for Shopping.com (now an eBay company). John is the coauthor of Practical Data Science with R (Manning Publications, 2014). John started his advanced education in mathematics at UC Berkeley and holds a PhD in computer science from Carnegie Mellon (specializing in the design and analysis of randomized algorithms). He currently blogs about technical issues at the Win-Vector blog, tweets at @WinVectorLLC, and is active in the Rotary. Please contact jmount@win-vector.com for projects and collaborations.

Presentations

Modeling big data with R, sparklyr, and Apache Spark Tutorial

Sparklyr provides an R interface to Spark. With sparklyr, you can manipulate Spark datasets to bring them into R for analysis and visualization and use sparklyr to orchestrate distributed machine learning in Spark from R with the Spark MLlib and H2O SparkingWater libraries. Sean Lopp, John Mount, and Garrett Grolemund demonstrate how to use sparklyr to analyze big data in Spark.

Presentations

Building the “future you” retirement planning service on a Hadoop data lake Tutorial

Chris Murphy and Emma Jones explain how leading global insurer Zurich adopted Hadoop to leverage its data to underpin the new customer-centric ethos of the organization and how this has enabled a new approach to helping customers truly understand their financial portfolios, build a roadmap to meet their financial goals, identify opportunities, and help them secure their financial future.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).

Presentations

The future of column-oriented data processing with Arrow and Parquet Session

In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, such as RDMA, SSDs, and nonvolatile memory.

Presentations

Modeling big data with R, sparklyr, and Apache Spark Tutorial

Sparklyr provides an R interface to Spark. With sparklyr, you can manipulate Spark datasets to bring them into R for analysis and visualization and use sparklyr to orchestrate distributed machine learning in Spark from R with the Spark MLlib and H2O SparkingWater libraries. Sean Lopp, John Mount, and Garrett Grolemund demonstrate how to use sparklyr to analyze big data in Spark.

Jack Norris is the senior vice president of data and applications at MapR Technologies. Jack has a wide range of demonstrated successes, from defining new markets for small companies to increasing sales of new products for large public companies, in his 20 years spent in enterprise software marketing. Jack’s broad experience includes launching and establishing analytics, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity, Brio Technology, SQRIBE, and Bain & Company. Jack has an MBA from UCLA’s Anderson School of Management and a BA in economics with honors and distinction from Stanford University.

Presentations

The main event: Identifying and exploiting the keys to digital transformation Session

Leading companies are integrating operations and analytics to make real-time adjustments to improve revenues, reduce costs, and mitigate risks. There are many aspects to digital transformation, but the timely delivery of actionable data is both a key enabler and an obstacle. Jack Norris explores how companies from TransUnion to Uber use event-driven processing to transform their businesses.

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Big data as a force for good Session

In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved.

Avinash Padmanabhan is a staff quality engineer in Intuit’s Small Business Data and Analytics group, where he focuses on ensuring quality of the data pipeline that enables the work of analysts and business stakeholders.
Avinash has over 12 years of experience specializing in building frameworks and solutions that solve challenging
quality problems and delight customers. He holds a master’s degree in electrical and computer engineering from the State University of New York.

Presentations

Shifting left for continuous quality in an Agile data world Session

Data warehouses are critical in driving business decisions—with SQL dominantly used to build ETL pipelines. While the technology has shifted from using RDBMS-centric data warehouses to data pipelines based on Hadoop and MPP databases, engineering and quality processes have not kept pace. Avinash Padmanabhan highlights the changes that Intuit's team made to improve processes and data quality.

Shoumik Palkar is a second-year PhD student in the Infolab at Stanford University, working with Matei Zaharia on high-performance data analytics. He holds a degree in electrical engineering and computer science from UC Berkeley.

Presentations

Weld: An optimizing runtime for high-performance data analytics Session

Modern data applications combine functions from many libraries and frameworks and cannot achieve peak hardware performance due to data movement across functions. Shoumik Palkar offers an overview of Weld, an optimizing runtime that enables optimizations across disjoint libraries, and explains how to integrate it into frameworks such as Spark SQL for performance gains with no changes to user code.

Gene Pang is a software engineer at Alluxio. Previously, he worked at Google. Gene recently earned his PhD from the AMPLab at UC Berkeley, working on distributed database systems, and holds an MS from Stanford University and a BS from Cornell University.

Presentations

Alluxio (formerly Tachyon): The journey thus far and the road ahead Session

Alluxio (formerly Tachyon) is an open source memory-speed virtual distributed storage system. The project has experienced a tremendous improvement in performance and scalability and was extended with key new features. Haoyuan Li and Gene Pang explore Alluxio's goal of making its product accessible to an even wider set of users, through a focus on security, new language bindings, and APIs.

Kishore Papineni is director of information strategy and management and RWI and analytics at Astellas Pharma.

Presentations

Governance and metadata management of Cigna's enterprise data lake Tutorial

Launched in late 2015, Cigna's enterprise data lake project is taking the company on a data governance journey. Kishore Papineni offers an overview of the project, providing insights into some of the business pain points and key drivers, how it has led to organizational change, and the best practices associated with Cigna’s new data governance process.

Kartik Paramasivam is a senior software engineering leader at LinkedIn. Kartik specializes in cloud computing, distributed systems, enterprise and cloud messaging, stream processing, the internet of things, web services, middleware platforms, application hosting, and enterprise application integration (EAI). He has authored a number of patents. Kartik holds a bachelor of engineering from the Maharaja Sayajirao University of Baroda and an MS in computer science from Clemson University.

Presentations

Processing millions of events per second without breaking the bank Session

LinkedIn has one of the largest Kafka installations in the world, ingesting more than a trillion messages per day. Apache Samza-based stream-processing applications process this deluge of data. Kartik Paramasivam discusses key improvements and architectural patterns that LinkedIn has adopted in its data systems in order to process millions of requests per second while keeping costs in control.

Nixon Patel is founder, CEO, and MD of analytics startup Kovid. Nixon is a visionary leader, an exemplary technocrat, and a successful, business-oriented entrepreneur with a proven track record for growing six businesses from startups to large, global technology companies with millions in annual sales—in industries ranging from big data technology, analytics, the cloud, the IoT, speech recognition, and machine learning to renewable energy, information technology, telecommunications, and pharmaceuticals—all over a short 26-year span. Previously, he was chief data scientist at Brillio, a sister company of Collabera, where he was instrumental in starting the big data analytics, cloud, and IoT practices and establishing centers of excellence and co-innovation labs. He is also an independent director in VivMed Labs and Tripborn. Nixon holds a BT with honors in chemical engineering from IIT Kharagpur, an MS in computer science from the New Jersey Institute of Technology, and a data science specialization from Johns Hopkins University. He is currently pursuing a second master’s degree in business and science in analytics from Rutgers University.

Presentations

Real-time analysis of behavior of law enforcement encounters using big data analytics and deep learning multimodal emotion-recognition models Tutorial

Being able to monitor the emotional state of police officers over a period of time would enable the officers’ supervisors to intervene if a given officer is subject to repeated emotional stress. Nixon Patel presents deep learning and AI models that capture and analyze the emotional state of law enforcement officers over a period of time.

Josh Patterson is the director of field engineering for Skymind. Previously, Josh ran a big data consultancy, worked as a principal solutions architect at Cloudera, and was an engineer at the Tennessee Valley Authority, where he was responsible for bringing Hadoop into the smart grid during his involvement in the openPDC project. Josh is a graduate of the University of Tennessee at Chattanooga with a master of computer science, where he did research in mesh networks and social insect swarm algorithms. Josh is a cofounder of the DL4J open source deep learning project and is a coauthor on the upcoming O’Reilly title Deep Learning: A Practitioner’s Approach. Josh has over 15 years’ experience in software development and continues to contribute to projects such as DL4J, Canova, Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif.

Presentations

Scalable deep learning for the enterprise with DL4J Tutorial

Dave Kale, Melanie Warwick, Susan Eraly, and Josh Patterson explain how to build, train, and deploy neural networks using Deeplearning4j. Topics include the fundamentals of deep learning, ND4J and DL4J, and scalable training using GPUs and Apache Spark. You'll gain hands-on experience with several models, including convolutional and recurrent neural nets.

Vanja Paunić is a data scientist on the Azure Machine Learning team at Microsoft. Previously, Vanja worked as a research scientist in the field of bioinformatics, where she published on uncertainty in genetic data, genetic admixture and prediction of genes. She holds a PhD in computer science with a focus on data mining from the University of Minnesota.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Frances Perry is a software engineer who likes to make big data processing easy, intuitive, and efficient. After many years working on Google’s internal data processing stack, Frances joined the Cloud Dataflow team to make this technology available to external cloud customers. She led the early work on Dataflow’s unified batch/streaming programming model and is on the PMC for Apache Beam.

Presentations

Learn stream processing with Apache Beam Tutorial

Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau, Frances Perry, and Jesse Anderson cover the basics of robust stream processing with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow.

Ganesh Prabhu is a staff software engineer at FireEye with 20+ years of RDBMS and engineering experience.

Presentations

FireEye's journey migrating 25 TB of RDBMS data to Hadoop Session

Ganesh Prabhu and Vivek Agate present an approach that enabled a small team at FireEye to migrate 20 TB of RDBMS data comprised of over 250 tables and nearly 2,000 partitions data to Hadoop and present an adaptive platform that allows migration of a rapidly changing dataset to Hive. Along the way, they share some of the challenges typical for a company embarking at Hadoop implementation.

Ryan Pridgeon is a customer operations engineer at Confluent. Ryan has a deep-rooted passion for tinkering that knows no bounds. Be it automotive, software, or carpentry, if it has pieces, Ryan wants to take it apart. He’s still working on putting things back together though.

Presentations

Mistakes were made, but not by us: Lessons from a year of supporting Apache Kafka Session

Dustin Cote and Ryan Pridgeon share their experience troubleshooting Apache Kafka in production environments and discuss how to avoid pitfalls like message loss or performance degradation in your environment.

Kishore Reddipalli is a senior technical architect at GE Digital leading the product architecture for Predix product operations optimization. Kishore has been with GE for nearly eight years, working with various industrial business domains including oil and gas, transportation, power, and renewables. He is an expert in building big data applications. Prior to joining GE Digital, Kishore worked in GE Healthcare, where he helped build the next-generation EMR platform Qualibria.

Presentations

Optimizing industrial operations in real time using the big data ecosystem Session

Kishore Reddipalli explores how to stream data at a large scale from the edge to the cloud to the client, detect anomalies, analyze machine data in stream and rest in an industrial world, and optimize the industrial operations by providing real-time insights and recommendations using big data technologies.

Prasanna Rajaperumal is a senior engineer at Uber, working on building the next generation Uber Data infrastructure. At Uber, he has been building data systems that scale along with Uber’s hyper growth. Over the last 6 months, he has been focussing on building a library that ingests change logs into large HDFS datasets, optimized for analytical workloads.

Over the last 12 years, he has had various roles at small to large companies building data systems. Prior to Uber, he was a software engineer in Cloudera working on building out Data Infrastructure for indexing and visualizing customer log files.

Presentations

Incremental processing on Hadoop at Uber Session

To fulfill its mission, Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. Vinoth Chandar explores data processing systems for near-real-time use cases, making the case that adding new incremental processing primitives to existing Hadoop technologies can solve many problems at reduced cost and in a unified manner.

Karthik Ramasamy is the engineering manager and technical lead for real-time analytics at Twitter. Karthik is the cocreator of Heron and has more than two decades of experience working in parallel databases, big data infrastructure, and networking. He cofounded Locomatix, a company that specializes in real-time stream processing on Hadoop and Cassandra using SQL, which was acquired by Twitter. Before Locomatix, he had a brief stint with Greenplum, where he worked on parallel query scheduling. Greenplum was eventually acquired by EMC for more than $300M. Prior to Greenplum, Karthik was at Juniper Networks, where he designed and delivered platforms, protocols, databases, and high-availability solutions for network routers that are widely deployed in the Internet. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik has a PhD in computer science from UW Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Anomaly detection in real-time data streams using Heron Session

Anomaly detection plays a key role in the context of analysis of real-time streams. This is exemplified by, say, detection incidents in real life from tweet storms. Arun Kejariwal and Karthik Ramasamy walk you through how anomaly detection is supported in real-time data streams in Heron—the streaming system built in-house at Twitter (and open sourced) for real-time computation.

Delip Rao is the founder of Joostware, a San Francisco-based company, specializing in consulting, and building IP in natural language processing and deep learning. Delip is a well-cited researcher in Natural Language Processing and Machine Learning and has worked at Google Research, Twitter, and Amazon (Echo) on various NLP problems and his interests are in building cost-effective, state-of-the-art AI solutions that scale well. He also has an upcoming book on NLP and Deep Learning published by O’Reilly Media.

Presentations

Deep learning for natural language understanding Session

Deep learning has revolutionized AI and natural language processing in particular. Delip Rao walks you through some of the recent breakthroughs in NLP and the deep learning technologies powering them and discusses some of the challenges in building and deploying such systems.

Fred Reiss is chief architect and one of the founding employees of the IBM Spark Technology Center in San Francisco. Previously, Fred worked for IBM Research Almaden for nine years, where he worked on the SystemML and SystemT projects as well as on the research prototype of DB2 with BLU Acceleration. He has over 25 peer-reviewed publications and six patents. Fred holds a PhD from UC Berkeley.

Presentations

Compressed linear algebra in Apache SystemML Session

Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines.

Leveraging deep learning to predict breast cancer proliferation scores with Apache Spark and Apache SystemML Session

Estimating the growth rate of tumors is a very important but very expensive and time-consuming part of diagnosing and treating breast cancer. Michael Dusenberry and Frederick Reiss describe how to use deep learning with Apache Spark and Apache SystemML to automate this critical image classification task.

Andreas Ribbrock is principal data scientist at #zeroG, a Lufthansa Systems company, where he is building a big data architecture and data science team. Previously, Andreas was the head of product management and data scientist for Cupertino- and Cologne-based IoT database startup ParStream and the team leader for the data science practice of analytics heavyweight Teradata, where he delivered leading-edge data science solutions in various industries for global players like DHL Express, Deutsche Post, Lufthansa, Otto Group, Siemens, Deutsche Telekom, Vodafone Australia, and Volkswagen. He has presented at international conferences on topics related to big data architectures, data science, and data warehousing. Andreas holds a PhD in computer science from Bonn University. His fields of study included signal processing, data structures, and algorithms for content based retrieval in nonrelationally structured data like images, audio signals, and 3D protein molecules.

Presentations

How Lufthansa German Airlines is using data analytics to create the next level of customer experience Tutorial

The aviation industry is facing a huge pressure in costs as well as a profound disruption in marketing and service. With ticket revenues dropping, increasing the customer loyalty is key. Andreas Ribbrock explains how Lufthansa German Airlines uses data science and data-driven decision making to create the next level of digital customer experience along the full customer journey.

Eric Richardson is a senior engineer at the American Chemical Society. He enjoys late-night coding sessions and long walks through architectures. Eric believes in architectures built on sound principles. He likes a tall, frosty glass of collaboration and information sharing and knows that people make better decisions when they know the landscape. When the time is right you must free your architecture, and let it find its own path—it will exceed your wildest hopes…and fears.

Presentations

Architecting an enterprise data hub in a 110-year-old company Session

Eric Richardson explains how CAS used Hadoop, HBase, Spark, Kafka, and Solr to create a hybrid cloud enterprise data hub that scales without drama and drives adoption by ease of use, covering the architecture, technologies used, the challenges faced and defeated, and problems yet to solve.

Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.

Presentations

BI and SQL analytics with Hadoop in the cloud Session

Henry Robinson and Justin Erickson explain how to best take advantage of the flexibility and cost-effectiveness of the cloud with your BI and SQL analytic workloads using Apache Hadoop and Apache Impala (incubating) to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments.

Alan Ross is a senior principal engineer and chief cloud security architect at Intel. Alan has more than 20 years of information security experience in various capacities, from policy and awareness and security/risk analysis to engineering and architecture. Previously, Alan worked as a security administrator and engineer for two global companies, focusing on network, host, and application security. He has 21 US patents and many others pending relating to security and manageability of systems and networks. Alan is currently leading activities around Open Network Insight, an open source project for advanced analytics of network telemetry.

Presentations

Paint the landscape and secure your data center with Apache Spot Session

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud-scale for anomaly detection.

Presentations

Modeling big data with R, sparklyr, and Apache Spark Tutorial

Sparklyr provides an R interface to Spark. With sparklyr, you can manipulate Spark datasets to bring them into R for analysis and visualization and use sparklyr to orchestrate distributed machine learning in Spark from R with the Spark MLlib and H2O SparkingWater libraries. Sean Lopp, John Mount, and Garrett Grolemund demonstrate how to use sparklyr to analyze big data in Spark.

Sparklyr: An R interface for Apache Spark Session

Sparklyr makes it easy and practical to analyze big data with R—you can filter and aggregate Spark DataFrames to bring data into R for analysis and visualization and use R to orchestrate distributed machine learning in Spark using Spark ML and H2O SparkingWater. Garrett Grolemund walks through these features and demonstrates how to use sparklyr to create R functions that access the full Spark API.

Jiphun Satapathy is a senior security architect at Visa, where he leads the security architecture of Visa’s digital and mobile products like Visa Token Service, Visa Checkout, Visa Direct which are used by millions of users. Jiphun’s areas of expertise include application security, data security, and cloud security. Previously, he was a software architect at Intel, where he led multiple teams to deliver products leveraging HW security.

Presentations

End-to-end security for Kafka, Spark ML, and Hadoop Session

Apache Kafka is today used by over 35% of Fortune 500 companies to store and process some of their most sensitive datasets. Ajit Gaddam and Jiphun Satapathy provide a security reference architecture to secure your Kafka cluster while leveraging it to support your organization's cybersecurity requirements.

Andrei Savu is a software engineer at Cloudera, where he’s working on Cloudera Director, a product that makes Hadoop deployments in cloud environments easy and more reliable for customers.

Presentations

A deep dive into leveraging cloud infrastructure for data engineering workloads Session

Cloud infrastructure, with a scalable data store and elastic compute, is particularly well suited for large-scale data engineering workloads. Andrei Savu and Jennifer Wu explore the latest cloud technologies and outline cost, security, and ease-of-use considerations for data engineers.

Deploying and operating big data analytic apps on the public cloud Tutorial

Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

William Schmarzo is the CTO of Dell EMC, where he is responsible for setting the strategy and defining the service line offerings and capabilities for the EMC Consulting Enterprise Information Management and Analytics service line. Bill has more than two decades of experience in data warehousing, BI, and analytics applications. He authored the Business Benefits Analysis methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements and has served on the Data Warehouse Institute’s faculty as the head of the analytic applications curriculum. Previously, Bill was the vice president of analytics at Yahoo, where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of actionable insights through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing, and sales of their industry-defining analytic applications. Bill is the author of Big Data: Understanding How Data Powers Big Business (forthcoming from Wiley), has written several white papers, and coauthored with Ralph Kimball a series of articles on analytic applications. He is a frequent speaker on the use of big data and advanced analytics to power an organization’s key business initiatives. Bill holds a masters degree in business administration from the University of Iowa and a bachelor of science degree in mathematics, computer science, and business administration from Coe College. You can find out more on the EMC website.

Presentations

Determining the economic value of your data Tutorial

Organizations need a model to measure how effectively they are using data and analytics. Once they know where they are and where they need to go, they then need a framework to determine the economic value of their data. William Schmarzo explores techniques for getting business users to “think like a data scientist” so they can assist in identifying data that makes the best performance predictors.

Eric Schmidt is the product management lead for Cloud Dataflow on the Cloud engineering team at Google, where his primary role is to help shape the future of fully managed, large-scale data processing. Eric spends the majority of his time working with existing cloud customers and on-premises developers who are moving their MapReduce and related data processing workloads to the cloud. He led the announcement of Cloud Dataflow (as Google I/O’s 2014 keynote) with the development of a real-time sentiment analysis and results prediction framework for the 2014 World Cup. Eric has a deep passion for user interaction modeling, data modeling, and analytical processing of user behaviors. He also has development experience with .NET, C, JavaScript, Python, and Java and cloud expertise with Microsoft Azure and Amazon Web Services. Previously, Eric was a senior director within Technical Evangelism and Development at Microsoft, where he led a team focusing on the adoption of Microsoft’s devices and cloud services. His team delivered Emmy Award-winning applications for NBC Sports, the Democratic National Convention, NCAA March Madness on Demand, and Major League Soccer. Eric also spearheaded the development of a real-time quality and user behavior system for the online streaming of the 2008, 2010, and 2012 Olympic games.

Presentations

Analyzing every pitch from the 2016 MLB season: Insights, anomalies, and predictions Session

Some people say baseball is boring; others say data analysis is boring—but with the right data and tools and an open mind, both can be fun. Drawing on MLB data, Eric Schmidt explores data processing, analysis, and insight, using pitch-by-pitch data to answer questions like: Will a bunt happen during this at bat? Are some nationalities better fastball hitters? Will the next pitch be a strike?

Robert Schroll is the data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning with TensorFlow 2-Day Training

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies, Inc. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for six years.

Presentations

Cloudy with a chance of on-prem Tutorial

The cloud is becoming pervasive, but it isn’t always full of rainbows. Defining a strategy that works for your company for your use cases is critical to ensuring success. Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations.

Jonathan Seidman is a solutions architect on the Partner Engineering team at Cloudera. Before joining Cloudera, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the Internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly Media.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

One cluster does not fit all: Architecture patterns for multicluster Apache Kafka deployments Session

There are many good reasons to run more than one Kafka cluster. And a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions to help you better choose the right architecture for your needs.

Stream me up, Scotty: Transitioning to the cloud using a streaming data platform Session

Gwen Shapira and Bob Lehmann share their experience and patterns building a cross-data-center streaming data platform for Monsanto. Learn how to facilitate your move to the cloud while "keeping the lights on" for legacy applications. In addition to integrating private and cloud data centers, you'll discover how to establish a solid foundation for a transition from batch to stream processing.

Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

Presentations

Unravelling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value.

Jeff Shmain is a senior solution architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Unravelling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value.

Vartika Singh is a solutions architect at Cloudera with over 10 years of experience applying machine-learning techniques to big data problems.

Presentations

Unravelling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value.

Mehmet Irmak Sirer is a partner and data scientist at Datascope Analytics, where he has helped companies across industries solve problems with data, from small companies to members of the Fortune 50. Irmak has conducted and published academic research on a wide range of topics, including student choices in public schools, web browsing behavior of masses, global airline networks, species conservation in ecology, language topic models, and optimizing DNA sequences for high gene expression among others. Technically, Irmak is a material scientist (with both a BS and an MS from Sabanci University, Turkey), a chemical and biological engineer (with both an MS and a PhD from Northwestern University), and an art historian (his minor at Sabanci University). Practically, he believes in merging knowledge from different disciplines to ask and answer the right questions. When he is not striving for this, he believes in movies, bourbon, and Elliott Smith.

Presentations

Four key skills for vice presidents, directors, and managers in data-driven organizations Session

In a data-driven organization, vice presidents, directors, and managers play a crucial role as translators between senior leadership and data science teams. They don’t need to be full-fledged data scientists, but they do need data science "street smarts” in order to succeed in this critical task. Mehmet Irmak Sirer outlines the skills they need and gives practical ways to improve them.

Ram Shankar is a security data wrangler in Azure Security Data Science, where he works on the intersection of ML and security. Ram’s work at Microsoft includes a slew of patents in the large intrusion detection space (called “fundamental and groundbreaking” by evaluators). In addition, he has given talks in internal conferences and received Microsoft’s Engineering Excellence award. Ram has previously spoken at data-analytics-focused conferences like Strata San Jose and the Practice of Machine Learning as well as at security-focused conferences like BlueHat, DerbyCon, FireEye Security Summit (MIRCon), and Infiltrate. Ram graduated from Carnegie Mellon University with master’s degrees in both ECE and innovation management.

Presentations

Operationalizing security data science for the cloud: Challenges, solutions, and trade-offs Session

Ram Shankar Siva Kumar and Andrew Wicker explain how to operationalize security analytics for production in the cloud, covering a framework for assessing the impact of compliance on model design, six strategies and their trade-offs to generate labeled attack data for model evaluation, key metrics for measuring security analytics efficacy, and tips to scale anomaly detection systems in the cloud.

Kaarthik Sivashanmugam is a principal software engineer on the Shared Data platform team at Microsoft. Kaarthik is the tech lead for the Mobius project specializing in Spark Streaming. Prior to joining the Shared Data platform team, he was on the Bing Ads team, where he built a near real-time analytics platform using Kafka, Storm, and Elasticsearch and used it to implement data processing pipelines. Previously, at Microsoft, Kaarthik was involved in the development of Data Quality Services in Azure and also contributed to multiple releases of SQL Server Integration Services as a hands-on engineering manager. Before joining Microsoft, Kaarthik was a senior software engineer in a semantic technology startup, where he built an ontology-based semantic metadata platform and used it to implement solutions for KYC/AML analytics.

Presentations

Spark at scale in Bing: Use cases and lessons learned Session

Spark powers various services in Bing, but the Bing team had to customize and extend Spark to cover its use cases and scale the implementation of Spark-based data pipelines to handle internet-scale data volume. Kaarthik Sivashanmugam explores these use cases, covering the architecture of Spark-based data platforms, challenges faced, and the customization done to Spark to address the challenges.

Crystal Skelton is an associate in Kelley Drye & Warren’s Los Angeles office, where she represents a wide array of clients from tech startups to established companies in privacy and data security, advertising and marketing, and consumer protection matters. Crystal advises clients on privacy, data security, and other consumer protection matters, specifically focusing on issues involving children’s privacy, mobile apps, data breach notification, and other emerging technologies and counsels clients on conducting practices in compliance with the FTC Act, the Children’s Online Privacy Protection Act (COPPA), the Gramm-Leach-Bliley Act, the GLB Safeguards Rule, Fair Credit Reporting Act (FCRA), the Fair and Accurate Credit Transactions Act (FACTA), and state privacy and information security laws. She regularly drafts privacy policies and terms of use for websites, mobile applications, and other connected devices.

Crystal also helps advertisers and manufacturers balance legal risks and business objectives to minimize the potential for regulator, competitor, or consumer challenge while still executing a successful campaign. Her advertising and marketing experience includes counseling clients on issues involved in environmental marketing, marketing to children, online behavioral advertising (OBA), commercial email messages, endorsements and testimonials, food marketing, and alcoholic beverage advertising. She represents clients in advertising substantiation proceedings and other matters before the Federal Trade Commission (FTC), the US Food and Drug Administration (FDA), and the Alcohol and Tobacco Tax and Trade Bureau (TTB) as well as in advertiser or competitor challenges before the National Advertising Division (NAD) of the Council of Better Business Bureaus. In addition, she assists clients in complying with accessibility standards and regulations implementing the Americans with Disabilities Act (ADA), including counseling companies on website accessibility and advertising and technical compliance issues for commercial and residential products. Prior to joining Kelley Drye, Crystal practiced privacy, advertising, and transactional law at a highly regarded firm in Washington, DC, and as a law clerk at a well-respected complex commercial and environmental litigation law firm in Los Angeles, CA. Previously, she worked at the law firm featured in the movie Erin Brockovich, where she worked directly with Erin Brockovich and the firm’s name partner to review potential new cases.

Presentations

Doing data right: Legal best practices for making your data work Session

Big data promises enormous benefits for companies, and new innovations in this space only mean more data collection is required. Having a solid understanding of legal obligations will help you avoid the legal snafus that can come with collecting big data. Alysa Hutnik and Crystal Skelton outline legal best practices and practical tips to avoid becoming a big data “don’t.”

Emily Spahn is a data scientist at ProKarma. Emily enjoys leveraging data to solve a wide range of problems. Previously, she worked as a civil engineer with a focus on hydraulic and hydrologic modeling and has worked in government, private industry, and academia over her career. Emily holds degrees in physics and environmental engineering.

Presentations

Saving lives with data: Identifying patients at risk of decline Session

Many hospitals combine early warning systems with rapid response teams (RRT) to detect patient decline and respond with elevated care. Predictive models can minimize RRT events by identifying at-risk patients, but modeling is difficult because events are rare and features are varied. Emily Spahn explores the creation of one such patient-risk model and shares lessons learned along the way.

Raghotham Sripadraj is cofounder and data scientist at Unnati Data Labs, where he is building end-to-end data science systems in the fields of fintech, marketing analytics, and event management. Raghotham is also a mentor for data science on Springboard. Previously, at Touchpoints Inc., he single-handedly built a data analytics platform for a fitness wearable company; at Redmart, he worked on the CRM system and built a sentiment analyzer for Redmart’s social media; and at SAP Labs, he was a core part of what is currently SAP’s framework for building web and mobile products, as well as a part of multiple company-wide events helping to spread knowledge both internally and to customers. Drawing on his deep love for data science and neural networks and his passion for teaching, Raghotham has conducted workshops across the world and given talks at a number of data science conferences. Apart from getting his hands dirty with data, he loves traveling, Pink Floyd, and masala dosas.

Presentations

Making architecture choices for small and big data problems Session

Not all data science problems are big data problems. Lots of small and medium product companies want to start their journey to become data driven. Nischal HP and Raghotham Sripadraj share their experience building data science platforms for various enterprises, with an emphasis on making the right architecture choices and using distributed and fault tolerant tools.

Abdul Subhan is the principal solutions architect for Verizon’s 4G Data Analytics team, where he is responsible for a developing, deploying, and managing a broad range of mission-critical reporting and analytics applications. Previously, Abdul spent close to a decade as an internal technology consultant to various parts of Verizon’s wireless, wireline, broadband, data center, and video businesses and managed data and technology with Nuance Communications and Alcatel Lucent. Abdul holds a master’s degree in computer engineering from King Fahd University of Petroleum & Minerals and an undergraduate degree in telecommunications from Visvesvaraya Technological University.

Presentations

From hours to milliseconds: How Verizon accelerated its mobile analytics Session

With more than 91M customers, Verizon produces oceans of data. The challenge this onslaught presents isn’t one of storage—it’s one of speed. The solution? Harnessing the power of GPUs to access insights in less than a millisecond. Todd Mostak and Abdul Subhan explain how Verizon solved its data challenge by implementing GPU-tuned analytics and visualization.

Sean is the CTO and cofounder of Pepperdata. Previously, Sean was the founding GM of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search, and managed the Yahoo Search Technology team, the first production user of Hadoop. He joined Yahoo through the acquisition of Inktomi. Sean holds a BS in engineering and applied science from Caltech.

Presentations

Big data for big data: Machine-learning models of Hadoop cluster behavior Session

Sean Suchter and Shekhar Gupta describe the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.

Brian Suda is a master informatician currently residing in Reykjavík, Iceland. Since first logging on in the mid-’90s, he has spent a good portion of each day connected to the internet. When he is not hacking on microformats or writing about web technologies, he enjoys taking kite aerial photography. His own little patch of internet can be found at Suda.co.uk, where many of his past projects, publications, interviews, and crazy ideas can be found.

Presentations

Introduction to visualizations using D3 Tutorial

Visualizations are a key part of conveying any dataset. D3 is the most popular, easiest, and most extensible way to get your data online in an interactive way. Brian Suda outlines best practices for good data visualizations and explains how you can build them using D3.

Arvind Surve is software developer in IBM’s Spark Technology Center in San Francisco. Arvind is a SystemML contributor and committer. He has worked for IBM for 17+ years. Arvind has presented at the 2015 Data Engineering Conference in Tokyo and for the Chicago Spark User group. He holds an MS in digital electronics and communication systems and an MBA in finance and marketing.

Presentations

Compressed linear algebra in Apache SystemML Session

Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines.

Mahesh Goud is a data scientist on Ticketmaster’s Data Science team, where he focuses on the automation and optimization of its paid customer acquisition systems and works on quantifying and modeling various data sources used by the Paid Acquisition Optimization Engine to increase the efficacy of marketing spend. He has helped develop and test the platform since its conception. Previously, he was a software engineer at Citigroup, where he was involved in the development of a real-time stock pricing engine. Mahesh holds a master’s degree in computer science specializing in data science from the University of Southern California and a bachelor’s degree with honors in computer science specializing in computer vision from the International Institute of Information Techonology, Hyderabad.

Presentations

A contextual real-time bidding engine for search engine marketing Session

Mahesh Goud shares success stories using Ticketmaster's large-scale contextual bandit platform for SEM, which determines the optimal keyword bids under evolving keyword contexts to meet different business requirements, and covers Ticketmaster's streaming pipeline, consisting of Storm, Kafka, HBase, the ELK Stack, and Spring Boot.

Shubham Tagra is a member of the techinal staff at Qubole working on Presto and Hive development and making these solutions cloud ready. Previously, Shubham worked at NetApp on its storage area network. Shubham holds a bachelor’s degree in computer engineering from the National Institue of Technology, Karnataka, India.

Presentations

RubiX: A caching framework for big data engines in the cloud Session

Shubham Tagra offers an introduction to RubiX, a lightweight, cross-engine caching solution that works well with optimized columnar formats by caching only the required amount of data. RubiX can be used with any data analytics engine that reads data from remote sources via the Hadoop FileSystem interface without any changes to the source code of those engines.

David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cybersecurity. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science along with master’s degrees in both computer science and business administration.

Presentations

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies Session

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Daniel Templeton has a long history in high-performance computing, open source communities, and technology evangelism. Today Daniel works on the YARN development team at Cloudera, focused on the resource manager, fair scheduler, and Docker support.

Presentations

Docker on YARN Session

Docker makes it easy to bundle an application with its dependencies and provide full isolation, and YARN now supports Docker as an execution engine for submitted applications. Daniel Templeton explains how YARN's Docker support works, why you'd want to use it, and when you shouldn't.

Jasjeet Thind is the senior director of data science and engineering at Zillow. His group focuses on machine-learned prediction models and big data systems that power use cases such as Zestimates, personalization, housing indices, search, content recommendations, and user segmentation. Prior to Zillow, Jasjeet served as director of engineering at Yahoo, where he architected a machine-learned real-time big data platform leveraging social signals for user interest signals and content prediction. The system powers personalized content on Yahoo, Yahoo Sports, and Yahoo News. Jasjeet holds a BS and master’s degree in computer science from Cornell University.

Presentations

Zillow: Transforming real estate through big data and data science Session

Zillow pioneered providing access to unprecedented information about the housing market. Long gone are the days when you needed an agent to get comparables and prior sale and listing data. And with more data, data science has enabled more use cases. Jasjeet Thind explains how Zillow uses Spark and machine learning to transform real estate.

Rocky Tiwari is the manager of innovation and architecture at Transamerica.

Presentations

Transamerica's journey to Customer 360 and beyond Session

Vishal Bamba and Rocky Tiwari offer an overview of Transamerica's Customer 360 platform and the work done afterward to utilize this technology, including graph databases and machine learning to help create targeted segments for products and campaign.

Carlo Torniai is head of data science and analytics at Pirelli. Previously, he was a staff data scientist at Tesla Motors. He received his PhD in informatics from the Università degli Studi di Firenze, Italy.

Presentations

How a global manufacturing company built a data science capability from scratch Tutorial

Building a cross-functional data science team at a large, multinational manufacturing company presents a number of cultural, organizational, technical, and operational challenges. Carlo Torniai explains how Pirelli grew an organization that was able to deliver key insights in less than a year and shares advice for both new and established data science teams.

Teresa Tung is a technology fellow at Accenture Technology Labs, where she is responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s industry platforms like Accenture Digital Connected Products and Accenture Analytics Insights Platform. Teresa holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

DevOps for models: How to manage millions of models in production Session

As Accenture scaled from thousands to millions of predictive models, it needed automation to manage models at scale, ensure accuracy, prevent false alarms, and preserve trust as models are created, tested, and deployed into production. Teresa Tung, Jürgen Weichenberger, and Chris Kang discuss embracing DevOps for models by employing a self-healing approach to model life-cycle management.

Alexander Ulanov is a senior researcher at Hewlett Packard Labs, where he focuses his research on machine learning on a large scale. Currently, Alexander works on deep learning and graphical models. He has made several contributions to Apache Spark; in particular, he implemented the multilayer perceptron classifier. Previously, he worked on text mining, classification and recommender systems, and their real-world applications. Alexander holds a PhD in mathematical modeling from the Russian Academy of Sciences.

Presentations

Malicious site detection with large-scale belief propagation Session

Alexander Ulanov and Manish Marwah explain how they implemented a scalable version of loopy belief propagation (BP) for Apache Spark, applying BP to the large web-crawl data to infer the probability of websites to be malicious. Applications of BP include fraud detection, malware detection, computer vision, and customer retention.

Crystal Valentine is the vice president of technology strategy at MapR Technologies. Previously, Crystal was a consultant at Ab Initio working on big data problems for Fortune 100 clients and a tenure-track professor in the Department of Computer Science at Amherst College. Crystal was a Fulbright Scholar and holds a PhD in computer science from Brown University as well as a bachelor’s degree from Amherst College.

Presentations

Connected vehicle: A global cloud processing application Tutorial

Connected vehicle applications require massive amounts of data to be transported and analyzed—often in real time—in order to support high-frequency decision-making requirements. Crystal Valentine outlines the five foundational characteristics of a next-generation big data platform that can support innovative connected vehicle and mobility services applications.

Vinithra Varadharajan is an engineering manager in the Cloud organization at Cloudera, responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Vinithra was previously a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop life-cycle management.

Presentations

Deploying and operating big data analytic apps on the public cloud Tutorial

Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Naghman Waheed leads the Data Platforms team at Monsanto and is responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order-to-cash, finance, and procurement. Throughout his 20+ year career at Monsanto, Naghman has held a variety of positions in the data space ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.

Presentations

The enterprise geospatial platform: A perfect fusion of cloud and open source technologies Session

The need to process geospatial data has grown in leaps and bounds at Monsanto. Recently, the volume of data collected from farmers' fields via sensors, rovers, drones, in-cabin technologies, and other sources has forced Monsanto to rethink its geospatial processing capabilities. Naghman Waheed explains how Monsanto built a scalable geospatial platform using cloud and open source technologies.

Marcus Walser is an engineering manager at Tapjoy, where he previously held positions as a staff operations engineer. Marcus has extensive experience working on Tapjoy’s mobile advertising technologies and was responsible for implementing new real-time architectures and advertising engines.

Presentations

Building a real-time analytics engine with Kafka and Spark for mobile advertising Tutorial

With more than 10,000 active applications, 4.23 million daily ad conversions, and over a billion total app downloads, Tapjoy knows mobile advertising. To ensure that users have the best application experiences, Tapjoy has architected a real-time advertising engine. Marcus Walser shares the critical considerations for building such a streaming architecture.

Dean Wampler is the architect for fast data products at Lightbend, where he specializes in scalable, distributed big data and streaming systems using tools like Spark, Mesos, Akka, Cassandra, and Kafka (the SMACK stack). Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly Media. He is a contributor to several open source projects and the co-organizer of several conferences around the world and several user groups in Chicago. Dean can be found on Twitter as @deanwampler.

Presentations

Just enough Scala for Spark Tutorial

Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs.

Rachel Warren is a programmer, data analyst, adventurer, and aspiring data scientist. After spending a semester helping teach algorithms and software engineering in Africa, Rachel has returned to the Bay Area, where she is looking for work as a data scientist or programmer. Previously, Rachel worked as an analyst for both Pandora and the Political Science department at Wesleyan. She is currently interested in pursuing a more technical, algorithmic, approach to data science and is particularly passionate about dynamic learning algorithms (ML) and text analysis. Rachel holds a BA in computer science from Wesleyan University, where she completed two senior projects: an application which uses machine learning and text analysis for the Computer Science department and a critical essay exploring the implications of machine learning on the analytic philosophy of language for the Philosophy department.

Presentations

Debugging Apache Spark Session

Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau and Rachel Warren explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.

Melanie Warrick is a deep learning engineer at Skymind with a passion for working on machine-learning problems at scale and AI. Melanie’s previous experience includes work with data science and engineering at Change.org and a comprehensive consulting career.

Presentations

Scalable deep learning for the enterprise with DL4J Tutorial

Dave Kale, Melanie Warwick, Susan Eraly, and Josh Patterson explain how to build, train, and deploy neural networks using Deeplearning4j. Topics include the fundamentals of deep learning, ND4J and DL4J, and scalable training using GPUs and Apache Spark. You'll gain hands-on experience with several models, including convolutional and recurrent neural nets.

What is AI? Tutorial

Melanie Warrick explores the definition of artificial intelligence and seeks to clarify what AI will mean for our world. Melanie summarizes AI’s most important effects to date and demystifies the changes we’ll see in the immediate future, separating myth from realistic expectation.

Randy is a software engineer in Uber. He holds a bachelor’s degree in computer science from the University of California, Berkeley.

Presentations

Uber's data science workbench Session

Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services providing convenient features such as task scheduling, model publishing, and job monitoring.

Jürgen Weichenberger is a data science senior principal at Accenture Analytics, where he is currently working within resources industries with interests in smart grids and power, digital plant engineering, and optimization for upstream industries and the water industry. Jürgen has over 15 years of experience in engineering consulting, data science, big data, and digital change. In his spare time, he enjoys spending time with his family and playing golf and tennis. Jürgen holds a master’s degree (with first-class honors) in applied computer science and bioinformatics from the University of Salzburg.

Presentations

DevOps for models: How to manage millions of models in production Session

As Accenture scaled from thousands to millions of predictive models, it needed automation to manage models at scale, ensure accuracy, prevent false alarms, and preserve trust as models are created, tested, and deployed into production. Teresa Tung, Jürgen Weichenberger, and Chris Kang discuss embracing DevOps for models by employing a self-healing approach to model life-cycle management.

Jay White Bear is a data scientist and advisory software engineer at IBM. Jay holds a degree in computer science from the University of Michigan, where her work focused on databases, machine learning, computational biology, and cryptography. Jay has also done work on multiobjective optimization, computational biology, and bioinformatics at the University of California, San Francisco and machine learning, multiobjective optimization for path planning and cryptography at McGill University.

Presentations

The IoT and the autonomous vehicle in the clouds: Simultaneous localization and mapping (SLAM) with Kafka and Spark Streaming Tutorial

The simultaneous localization and mapping (SLAM) problem is the cutting edge of robotics for autonomous vehicles and a key challenge in both industry and research. Jay White Bear shares a new integrated framework that demonstrates a constrained SLAM using online algorithms to navigate and map in real time using the Turtlebot II.

Andrew Wicker is a machine learning engineer in the Security division at Microsoft, where his current work focuses on researching and developing machine-learning solutions to protect identities in the cloud. Andrew’s previous work includes developing machine-learning models to detect safety events in an immense amount of FAA radar data and working on the development of a distributed graph analytics system. His expertise encompasses the areas of artificial intelligence, graph analysis, and large-scale machine learning. Andrew holds a BS, an MS, and a PhD in computer science from North Carolina State University.

Presentations

Operationalizing security data science for the cloud: Challenges, solutions, and trade-offs Session

Ram Shankar Siva Kumar and Andrew Wicker explain how to operationalize security analytics for production in the cloud, covering a framework for assessing the impact of compliance on model design, six strategies and their trade-offs to generate labeled attack data for model evaluation, key metrics for measuring security analytics efficacy, and tips to scale anomaly detection systems in the cloud.

Edd Wilder-James is a technology analyst, writer, and entrepreneur based in California. He’s helping transform businesses with data as VP of strategy for Silicon Valley Data Science. Formerly Edd Dumbill, Edd was the founding program chair for the O’Reilly Strata conferences and chaired the Open Source Convention for six years. He was also the founding editor of the peer-reviewed journal Big Data. A startup veteran, Edd was the founder and creator of the Expectnation conference-management system and a cofounder of the Pharmalicensing.com online intellectual-property exchange. An advocate and contributor to open source software, Edd has contributed to various projects such as Debian and GNOME and created the DOAP vocabulary for describing software projects. Edd has written four books including O’Reilly’s Learning Rails.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business.

The business case for deep learning, Spark, and friends Tutorial

Deep Learning is white-hot at the moment, but why does it matter? Developers are usually the first to understand why some technologies cause more excitement than others. Edd Wilder-James relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring.

Ian Wrigley has taught tens of thousands of students over the last 25 years in subjects ranging from C programming to Hadoop development and administration. Ian is currently the director of education services at Confluent, where he heads the team building and delivering courses focused on Apache Kafka and its ecosystem.

Presentations

Building real-time data pipelines with Apache Kafka Tutorial

Ian Wrigley demonstrates how Kafka Connect and Kafka Streams can be used together to build real-world, real-time streaming data pipelines. Using Kafka Connect, you'll ingest data from a relational database into Kafka topics as the data is being generated and then process and enrich the data in real time using Kafka Streams before writing it out for further analysis.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud strategy and solutions. Before joining Cloudera, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Presentations

A deep dive into leveraging cloud infrastructure for data engineering workloads Session

Cloud infrastructure, with a scalable data store and elastic compute, is particularly well suited for large-scale data engineering workloads. Andrei Savu and Jennifer Wu explore the latest cloud technologies and outline cost, security, and ease-of-use considerations for data engineers.

Deploying and operating big data analytic apps on the public cloud Tutorial

Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Yinglian Xie is the CEO and cofounder of DataVisor, a startup in the area of big data analytics for security. Yinglian has been working in the area of internet security and privacy for over 10 years and has helped improve the security of billions of online users. Her work combines parallel-computing techniques, algorithms for mining large datasets, and security-domain knowledge into new solutions that prevent and combat a wide variety of attacks targeting consumer-facing online services. Prior to DataVisor, Yinglian was a senior researcher at Microsoft Research Silicon Valley, where she shipped a series of new techniques in production. She has been widely published in top conferences and served on the committees of many of them. Yinglian holds a PhD in computer science from Carnegie Mellon University.

Presentations

Don’t sleep on sleeper cells: Using big data to drive detection Session

How many of your users are really fraudsters waiting to strike? These sleeper cells exist in all online communities. Using data from more than 400M users and 500B events from online services across the world, Yinglian Xie explores sleeper cells, explains sophisticated attack techniques being used to evade detection, and shows how Spark's in-memory big data security analytics can help.

Reynold Xin is a cofounder and chief architect at Databricks as well as an Apache Spark PMC member and release manager for Spark’s 2.0 release. Prior to Databricks, Reynold was pursuing a PhD at the UC Berkeley AMPLab, where he worked on large-scale data processing.

Presentations

A behind-the-scenes look into Spark's API and engine evolutions Session

Reynold Xin looks back at the history of data systems, from filesystems, databases, and big data systems (e.g., MapReduce) to "small data" systems (e.g., R and Python), covering the pros and cons of each, the abstractions they provide, and the engines underneath. Reynold then shares lessons learned from this evolution, explains how Spark is developed, and offers a peak into the future of Spark.

Tony Xing is a senior product manager on the Shared Data team within Microsoft’s Application and Service group. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service group. Tony recently gave a talk at Strata + Hadoop World in Beijing.

Presentations

The common anomaly detection platform at Microsoft Session

Tony Xing offers an overview of Microsoft's common anomaly detection platform, an API service built internally to provide product teams the flexibility to plug in any anomaly detection algorithms to fit their own signal types.

David Yan is an Apache Apex PMC member and an architect at DataTorrent. Previously, David worked on the Ad Systems, Yahoo Finance, and del.icio.us groups at Yahoo and the Artificial Intelligence group at the Jet Propulsion Laboratory. David holds an MS in computer science from Stanford University and a BS in electrical engineering and computer science from the University of California, Berkeley.

Presentations

Developing streaming applications with Apache Apex Session

David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics. With Apex, you can build applications that scalably and reliably process their data with high throughput and low latency.

An expert in quantitative modeling with a strong background in finance, Jeffrey Yau has over 17 years of experience applying econometric, statistic, and mathematical modeling techniques to real-world challenges. As vice president of data science at Silicon Valley Data Science, Jeffrey has a passion for leading data science teams in finding innovative solutions to challenging business problems.

Presentations

Graph-based anomaly detection: When and how Session

Thanks to frameworks such as Spark's GraphX and GraphFrames, graph-based techniques are increasingly applicable to anomaly, outlier, and event detection in time series. Jeffrey Yau offers an overview of applying graph-based techniques in fraud detection, IoT processing, and financial data and outlines the benefits of graphs relative to other techniques.

Ting-Fang Yen is a research scientist at DataVisor, a startup providing big data security analytics for consumer-facing web and mobile sites. Ting-Fang holds a PhD in electrical and computer engineering from Carnegie Mellon University.

Presentations

Cloudy with a chance of fraud: A look at cloud-hosted attack trends Session

When it comes to visibility into account takeover, spam, and fake accounts, the cloud is making things hazy. Cloud-hosted attacks skirt IP blacklists and make fraudulent users seem like they are located somewhere they are not. Drawing on data from 500 billion events and 400 million user accounts, Ting-Fang Yen examines cloud-based attack trends across verticals and regions.

Tristan Zajonc is a senior engineering manager at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and a MPA in international development from Harvard and a BA in economics from Pomona College.

Presentations

Making self-service data science a reality Session

Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality.

Hang Zhang is a senior data science manager on the Algorithm and Data Science team in the Data Group at Microsoft, where his major focus is on team data science processes and the Cortana Intelligence Competition Platform. Previously, Hang was a staff data scientist at WalmartLabs in charge of internal business intelligence tools and a senior data scientist at Opera Solutions. He is a senior member of the IEEE. Hang holds a PhD in industrial and systems engineering and an MS in statistics from Rutgers University.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Mengyue Zhao is a data scientist at Microsoft, where she develops end-to-end machine-learning solutions for various use cases in cloud computing and distributed platforms (e.g., Azure, Hadoop, and Spark). Mengyue focuses on scalable analysis, including data processing, feature engineering, feature selection, predictive modeling, and web services development. Previously, she was a data analyst at GE Digital, mainly focusing on solving machine-learning problems in the manufacturing domain. Mengyue has broad interests in machine learning, deep learning, and data mining and is passionate about harnessing the power of big data to answer interesting questions and drive business decisions. Mengyue holds a master’s degree in analytics from the University of San Francisco.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Alice Zheng manages the optimization team on Amazon’s Ad Platform. Alice specializes in research and development of machine-learning methods, tools, and applications. Outside of work, she is writing a book, Mastering Feature Engineering. Previously, Alice worked at GraphLab/Dato/Turi, where she led the machine-learning toolkits team and spearheaded user outreach, was a researcher in the Machine Learning group at Microsoft Research, Redmond, and was a postdoc at Carnegie Mellon University. Alice holds PhD and BA degrees in computer science and a BA in mathematics, all from UC Berkeley.

Presentations

Feature engineering for diverse data types Session

In the machine-learning pipeline, feature engineering takes up the majority amount of time yet is seldom discussed. Alice Zheng leads a tour of popular feature engineering methods for text, logs, and images, giving you an intuitive and actionable understanding of tricks of the trade.

As vice president of products at Trifacta, Wei Zheng combines her passion for technology with experience in enterprise software to define and shape Trifacta’s product offerings. Having founded several startups of her own, Wei believes strongly in innovative technology that solves real-world business problems. Most recently, she led product management efforts at Informatica, where she helped launch several new solutions including its Hadoop and data-virtualization products.

Presentations

Why the next wave of data lineage is driven by automation, visualization, and interaction Session

Sean Kandel and Wei Zheng offer an overview of an entirely new approach to visualizing metadata and data lineage, demonstrating automated methods for detecting, visualizing, and interacting with potential anomalies in reporting pipelines. Join in to learn what’s required to efficiently apply these techniques to large-scale data.

Chao is a senior data scientist at C+E Analytics and Insights within Microsoft. His current research interests include (deep) machine learning for customer journey and customer lifetime value and (deep) reinforcement learning for interactive customer behavior modeling. Previously, Chao was the lead data scientist at Scopely, a mobile gaming company in LA. Chao was an ABD (all but dissertation) PhD candidate in mathematics at Michigan Technological University. He holds an MS degree in financial engineering from Temple University and a BS degree in computer science from Beijing University of Aeronautics and Astronautics.

Presentations

Predicting customer lifetime value for a subscription-based business Session

Chao Zhong offers an overview of a new predictive model for customer lifetime value (LTV) in a cloud computing business. This model is also the first known application of the Fader RFM approach to a cloud business—a Bayesian approach that predicts a customer's LTV with a symmetric absolute percentage error (SAPE) of only 3% on an out-of-time testing dataset.

Feng Zhu, is a data scientist at C+E Analytics and Insights in Microsoft. His current role focuses on building the end-to-end solutions for various problems in Microsoft Cloud business using advanced machine learning technique. Previously, Feng was a research scientist at Fraud Detection and Risk Management Team in Amazon, where he collaborated with various business/engineering teams to provide fraud detection and mitigation solutions for Pay with Amazon product.

He holds a PhD in electrical engineering from the University of Notre Dame, and two MS degrees in electrical engineering and applied mathematics from the University of Notre Dame, and a BS from Harbin Institute of Technology, China.

Presentations

How Microsoft predicts churn of cloud customers using deep learning and explains those predictions in an interpretable way Session

Although deep learning has proved to be very powerful, few results are reported on its application to business-focused problems. Feng Zhu and Val Fontama explore how Microsoft built a deep learning-based churn predictive model and demonstrate how to explain the predictions using LIME—a novel algorithm published in KDD 2016—to make the black box models more transparent and accessible.