Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Mike Abbott is a general partner at Kleiner Perkins Caufield & Byers, where he focuses on investments in the firm’s digital practice, helping entrepreneurs in the social, mobile, and cloud computing sectors rapidly scale teams and ventures. Mike serves as an expert resource on enterprise infrastructure, cloud computing, and big data. He also helps entrepreneurs win the race for talent in a hypercompetitive recruitment environment. Mike is an engineering leader, entrepreneur, and investor and an expert in big data businesses. Formerly the vice president of engineering at Twitter, Mike led the team to rebuild and solidify Twitter’s infrastructure, growing the engineering team from 80 to more than 350 engineers in less than a year and a half and scaling Twitter’s architecture to support hundreds of millions of daily tweets. Before joining Twitter, Mike led the software development team at Palm that created HP/Palm’s next-generation webOS platform. Earlier in his career, Mike was the general manager at Microsoft for .NET online services, which became Azure. He founded Composite Software (acquired by Cisco) and was a cofounder of Passenger Inc. Mike has advised and invested in numerous software companies throughout his career, including Cloudera, Hearsay Labs, Jawbone, Saynow, Reverb Technologies, Toytalk, and Locu. Mike holds a bachelor’s degree from California Polytechnic State University and has completed coursework toward a PhD at the University of Washington.

Presentations

Unboxing logistics innovation Tutorial

Michael Abbott shares trends Kleiner Perkins Caufield & Byers is seeing in the area of transportation and logistics from an investments perspective and offers direct insights from companies in the sector, looking at how these firms deal with unique data processing challenges.

Vivek Agate is a staff software engineer at FireEye with 8+ years of experience in software design and development in various Java technologies.

Presentations

FireEye's journey migrating 25 TB of RDBMS data to Hadoop Session

Ganesh Prabhu, Alex Rivlin, and Vivek Agate share an approach that enabled a small team at FireEye to migrate 20 TB of RDBMS data comprised of 250+ tables and nearly 2,000 partitions to Hadoop and an adaptive platform that allows migration of a rapidly changing dataset to Hive. Along the way, they explore some of the challenges typical for a company implementing Hadoop.

John Mark Agosta is a principal data scientist in IMML at Microsoft. Over his career, he has worked with startups and labs in the Bay Area, including the original Knowledge Industries, and was a researcher at Intel Labs, where he was awarded a Santa Fe Institute Business Fellowship in 2007, and at SRI International after receiving his PhD from Stanford. He has participated in the annual Uncertainty in AI conference since its inception in 1985, proving his dedication to probability and its applications. When feeling low he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Ashvin Agrawal is a senior research engineer at Microsoft, where he works on streaming systems and contributes to the Twitter Heron project. Ashvin is a software engineer with more than 10+ years experience. He specializes in developing large-scale distributed systems. Previously, he worked at VMware, Yahoo, and Mojo Networks. Ashvin holds an MTech in computer science from IIT Kanpur, India.

Presentations

From rivulets to rivers: Elastic stream processing in Heron Session

Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience.

Shekhar Agrawal is the director of data science at Comcast. Shekhar is an expert data scientist with specialization in the text and NLP fields. He currently handles several PB-scale modeling initiatives to improve customer experience factors.

Presentations

Real-time analytics using Kudu at petabyte scale Session

Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.

Parvez Ahammad leads the data science and machine learning efforts at Instart Logic. His group is focused on creating data-driven algorithms and innovative product features that optimize and secure web application delivery at scale. He has applied machine learning in a variety of domains, most recently to computational neuroscience, web application delivery and web application security. Along the way, he has mentored data scientists, built teams and has had to grapple with issues like explainability and interpretability of ML systems, insufficient amount of labeled data, scalability, ethics, and adversaries who target ML models. Parvez holds a PhD in electrical engineering and computer sciences from UC Berkeley, with an emphasis in computer vision and machine learning.

Presentations

Applying machine learning in security: Past, present, and future Session

Recently, research on applying and designing ML algorithms and systems for security has grown quickly as information and communications have become more ubiquitous and more data has become available. Parvez Ahammad explores generalized system designs, underlying assumptions, and use cases for applying ML in security.

Manish Ahluwalia is a software engineer at Cloudera, where he focuses on security of the Hadoop ecosystem. Manish has been working in big data since its infancy in various companies in Silicon Valley. He is most passionate about security.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Mark Donsky, André Araujo, Michael Yoder, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Tyler Akidau is a senior staff software engineer at Google Seattle. He leads technical infrastructure’s internal data processing teams in Seattle (MillWheel & Flume), is a founding member of the Apache Beam PMC, and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102.” His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Ask me anything: Apache Beam AMA

Join Tyler Akidau, Frances Perry, Kenneth Knowles, and Slava Chernyak to discuss anything related to Apache Beam.

Learn stream processing with Apache Beam Tutorial

Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau and Frances Perry cover the basics of robust stream processing with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow.

The evolution of massive-scale data processing Session

Join Tyler Akidau for a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, as Tyler compares and contrasts systems at Google with popular open source systems in use today.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Julie Steele, Stephen O'Sullivan, and Scott Kurth field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for and the evolving role of the CDO. Even if you don’t have a specific question, join in to hear what others are asking.

Office Hour with John Akred and Stephen O'Sullivan (Silicon Valley Data Science) Office Hour

Want advice on building a data platform, tool selection, or integration with legacy systems? Talk to John and Stephen.

Khalid Al-Kofahi is vice president of Research and Development at Thomson Reuters, where he leads a team of scientists and engineers on performing applied research and custom algorithm development to help the business design and develop applications for the legal, financial, risk, and tax and accounting industries. In addition to his role as head of R&D at Thomson Reuters, Khalid is also leading the establishment of a Center for Cognitive Computing and AI for Thomson Reuters in Toronto, Canada. The Center focuses on simplifying and transforming how knowledge tasks get done through a combination of automation, machine-assisted, and task-focused natural experiences. Khalid’s research interests include natural language processing, information retrieval, machine learning, recommender systems, and computer vision. Khalid holds a PhD from Rensselaer Polytechnic Institute and an MS from Rochester Institute of Technology, both in computer engineering, and a BS in electrical engineering from Jordan University of Science and Technology.

Presentations

Becoming smarter about credible news Keynote

Data helps us understand our market in new and novel ways. In today's world, sifting through the noise in modern journalism means navigating enormous amounts of data, news, and tweets. Tom Reilly and Khalid Al-Kofahi explain how Thomson Reuters is leveraging big data and machine learning to chase down leads, verify sources, and determine what's newsworthy.

Sridhar Alla is the director of big data solutions and architecture at Comcast, where he has delivered several key solutions, such as the Xfinity personalization platform, clickthru analytics, and the correlation platform. Sridhar started his career in network appliances on NAS and caching technologies. Previously, he served as the CTO of security company eIQNetworks, where he merged the concepts of big data and security products. He holds patents on the topics of very large-scale processing algorithms and caching.

Presentations

Real-time analytics using Kudu at petabyte scale Session

Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.

Anima Anandkumar is a principal scientist at Amazon Web Services. Anima is currently on leave from UC Irvine, where she is an associate professor. Her research interests are in the areas of large-scale machine learning, nonconvex optimization, and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. Previously, she was a postdoctoral researcher at MIT and a visiting researcher at Microsoft Research New England. Anima is the recipient of several awards, including the Alfred. P. Sloan fellowship, the Microsoft faculty fellowship, the Google research award, the ARO and AFOSR Young Investigator awards, the NSF CAREER Award, the Early Career Excellence in Research Award at UCI, the Best Thesis Award from the ACM SIGMETRICS society, the IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums, such as the Quora ML session, Huffington Post, Forbes, and O’Reilly Media. Anima holds a BTech in electrical engineering from IIT Madras and a PhD from Cornell University.

Presentations

Distributed deep learning on AWS using MXNet Session

Anima Anandkumar demonstrates how to use preconfigured Deep Learning AMIs and CloudFormation templates on AWS to help speed up deep learning development and shares use cases in computer vision and natural language processing.

Office Hour with Anima Anandkumar (UC Irvine) Office Hour

Anima hosts a lively hour of discussion about large-scale machine learning and machine learning at Amazon web services.

Eric Anderson leads Beachbody’s Data organization, which includes BI, data warehousing, big data, and email and CRM platforms. Eric is focused on providing business value in the shortest amount of time by leveraging a wide array of technologies to deliver outcomes quickly. He’s delivered on data initiatives for over 15 years with Accenture and Slalom Consulting and worked for a number of other technology companies, including TrueCar and Edmunds. Eric is well versed in traditional data warehousing, BI platforms, Hadoop, MPP platforms, data integration tools, and AWS infrastructure and services. He holds an MBA.

Presentations

Building data lakes in the cloud with self-service access (sponsored by Talend) Session

Eric Anderson and Shyam Konda explain how the IT team at Beachbody—the makers of P90X and CIZE—successfully ingested all their enterprise data into Amazon S3 and delivered self-service access in less than six months with Talend.

Presentations

Unboxing logistics innovation Tutorial

Michael Abbott shares trends Kleiner Perkins Caufield & Byers is seeing in the area of transportation and logistics from an investments perspective and offers direct insights from companies in the sector, looking at how these firms deal with unique data processing challenges.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time data engineering in the cloud 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time data engineering in the cloud (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks—both open source and managed cloud services—discusses the leading cloud providers, and explains how to choose the right one for your company.

June Andrews is a data scientist at Pinterest working on enabling data-driven products and insights. Previously, she worked as technical lead of consumer analytics and staff data scientist at LinkedIn specializing in growth, engagement, and social network analysis. June’s work connected the global and local effects between LinkedIn’s professional network and individual members. She also worked on the search algorithm at Yelp, created the data analysis stack for Noom, a healthcare startup, and designed algorithms for computing the structure of large networks with John Hopcroft. June holds degrees in applied mathematics, computer science, and electrical engineering from UC Berkeley and Cornell.

Presentations

When is data science a house of cards? Replicating data science conclusions Session

An experiment at Pinterest revealed somewhat shocking results. When nine data scientists and ML engineers were asked the same constrained question, they gave nine spectacularly different answers. The implications for business are astronomical. June Andrews and Frances Haugen explore the aspects of analysis that cause differences in conclusions and offer some solutions.

André Araujo is a solutions architect with Cloudera. Previously, he was an Oracle database administrator. An experienced consultant with a deep understanding of the Hadoop stack and its components, André is skilled across the entire Hadoop ecosystem and specializes in building high-performance, secure, robust, and scalable architectures to fit customers’ needs. André is a methodical and keen troubleshooter who loves making things run faster.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Mark Donsky, André Araujo, Michael Yoder, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Eduardo Arino de la Rubia is chief data scientist at Domino Data Lab. Eduardo is a lifelong technologist with a passion for data science who thrives on effectively communicating data-driven insights throughout an organization. He is a graduate of the MTSU Computer Science department, General Assembly’s Data Science program, and the Johns Hopkins Coursera Data Science specialization. Eduardo is currently pursuing a master’s degree in negotiation, conflict resolution, and peacebuilding from CSUDH. You can follow him on Twitter as @earino.

Presentations

Leveraging open source automated data science tools Session

The promise of the automated statistician is as old as statistics itself. Eduardo Arino de la Rubia explores the tools created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation. Along the way, Eduardo compares open source tools such as TPOT and auto-sklearn and discusses their place in the DS workflow.

Michael Armbrust is the lead developer of the Spark SQL project at Databricks. Michael’s interests broadly include distributed systems, large-scale structured storage, and query optimization. Michael holds a PhD from UC Berkeley, where his thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence.

Presentations

Making Structured Streaming ready for production: Updates and future directions Session

Apache Spark 2.0 introduced the core APIs for Structured Streaming, a new streaming processing engine on Spark SQL. Since then, the Spark team has focused its efforts on making the engine ready for production use. Michael Armbrust and Tathagata Das outline the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases.

Sudhanshu Arora is a software engineer at Cloudera, where he leads the development for data management and governance solutions. Previously, Sudhanshu was with the platform team at Informatica, where he helped design and implement its next-generation metadata repository.

Presentations

Big data governance for the hybrid cloud: Best practices and how-to Session

Big data needs governance. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start—especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Sudhanshu Arora share a step-by-step approach to kick-start your big data governance initiatives.

Shivnath Babu is an associate professor of computer science at Duke University, where his research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. He is also the chief scientist at Unravel Data Systems, the company he cofounded to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has received a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award. He has given talks and distinguished lectures at many research conferences and universities worldwide. Shivnath has also spoken at industry conferences, such as the Hadoop Summit.

Presentations

Deep learning for IT operations intelligence using open source tools Session

Shivnath Babu offers an introduction to using deep learning to solve complex problems in IT operations analytics. Shivnath focuses on how deep learning can derive operations insights automatically for the complex big data application stack composed of systems such as Hadoop, Spark, Cassandra, Elasticsearch, and Impala, using examples of open source tools for deep learning.

Deep learning for IT operations intelligence using open source tools Session

Shivnath Babu offers an introduction to using deep learning to solve complex problems in IT operations analytics. Shivnath focuses on how deep learning can derive operations insights automatically for the complex big data application stack composed of systems such as Hadoop, Spark, Cassandra, Elasticsearch, and Impala, using examples of open source tools for deep learning.

Viral Bajaria is cofounder and CTO of 6Sense, where he leads the development of 6sense’s innovative analytics and predictive platform. Every day, Viral works to realize his passion of leveling the playing field for companies of all sizes with open source data technology. Previously, Viral built the big data platform at Hulu that processed over 2.5 billion events per day. He was an early adopter of Hadoop and built and managed a cluster that stored and processed over a petabyte of data. Viral was instrumental in building the infrastructure that powered reporting, financial, and recommendations systems across web and mobile devices. In his spare time, Viral enjoys contributing to open source projects.

Presentations

Inside predictive intelligence, the powerful technology disrupting sales and marketing Session

What if companies could predict what products people will buy, how much they will buy, and when? It would be a game changer—and it’s already possible with the power of predictive intelligence. Viral Bajaria explores how BlueJeans Network was able to leverage predictive analytics to uncover buyers earlier, convert them at a 20x higher rate, and build a $33M pipeline.

Kamil Bajda-Pawlikowski is a chief architect at Teradata’s Center for Hadoop, Boston. Previously, Kamil was a cofounder and chief software architect at Hadapt, a SQL-on-Hadoop company, and did graduate research at Yale University in the area of large-scale data processing, where he developed the HadoopDB project.

Presentations

Presto: Distributed SQL on anything (sponsored by Teradata) Session

Teradata joined the Presto community in 2015 and is now a leading contributor to this open source SQL engine, originally created by Facebook. Join Kamil Bajda-Pawlikowski to learn about Presto, Teradata's recent enhancements in query performance, security integrations, and ANSI SQL coverage, and its roadmap for 2017 and beyond.

Vishal Bamba is vice president of strategy and architecture at Transamerica Technology, where he leads a team focusing on innovation initiatives within the enterprise. Vishal has over 15 years of experience in distributed systems and has led many innovation projects. He has consulted and worked for several companies including Disney, Getty, Northrop, and AIG/SunAmerica. Vishal holds an MS in computer science from the University of Southern California.

Presentations

Transamerica's journey to Customer 360 and beyond Session

Vishal Bamba and Rocky Tiwari offer an overview of Transamerica's Customer 360 platform and the work done afterward to utilize this technology, including graph databases and machine learning to help create targeted segments for products and campaigns.

Dorna Bandari is a data scientist at Pinterest, where she specializes in developing new machine-learning models in a broad range of product areas, from concept creation to productionization.

Presentations

Clustering user sessions with NLP methods in complex internet applications Session

Most internet companies record a constant stream of logs as a user interacts with their application. Depending on the complexity of the application, the logs can be extremely difficult to decipher. Dorna Bandari presents a novel NLP-based method for clustering user sessions in consumer internet applications, which has proved to be extremely effective in both driving strategy and personalization.

Presentations

Machine learning and microservices: A framework for next-gen applications (sponsored by MapR Technologies) Session

Machine-learning algorithms can improve predictions and optimize business operations across industry verticals, but building and scoring models still presents a significant computational challenge requiring massive training data and complex pipelines. Nitin Bandugula outlines the benefits of implementing a microservices-based architecture to support a machine-learning model-scoring workflow.

Erin K. Banks is the portfolio marketing director for big data and analytics at Dell EMC. Erin has over 20 years’ experience in the IT industry. Previously, she worked at Juniper Networks in technical marketing for the Security business unit and at VMware and EMC as an SE in the Federal division, where she focused on virtualization and security. She holds both CISSP and CISA accreditations and is an author, blogger, and avid runner. Erin holds a BS in electrical engineering.

Presentations

Ingredients to a successful data analytics project (sponsored by Dell EMC) Session

A recent study suggests that 44 % of businesses are unsure what to do about big data. Erin Banks explains how big data analytics can help transform your business and ensure your data provides the greatest value to you, covering best business practices to help you achieve insights from your analytics, extract value from your data, and drive business change.

An executive and thought leader with a proven track record of success leading product strategy, product management, and development in business analytics. Bardoliwalla co-founded Tidemark Systems, Inc. where he drove the market, product, and technology efforts for their next-generation analytic applications built for the cloud. He formerly served as VP for product management, product development, and technology at SAP where he helped to craft the business analytics vision, strategy, and roadmap leading to the acquisitions of Pilot Software, OutlookSoft, and Business Objects. Prior to SAP, he helped launch Hyperion System 9 while at Hyperion Solutions. Nenshad began his career at Siebel Systems working on Siebel Analytics. Nenshad is also the lead author of Driven to Perform: Risk-Aware Performance Management From Strategy Through Execution.

Presentations

When big data leads to big results (sponsored by Paxata) Session

Thousands of companies have made their initial investments into next-generation data lake architecture, and they are on the verge of generating quality business returns. Chandhu Yalla and Neshad Bardoliwalla explain how enterprises have unlocked tangible value from their data lakes with adaptive information management and how their organizations are providing self-service to business units.

Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Before joining Amazon, Roger was in the Cloud Machine Learning group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.

Presentations

Amazon Kinesis data streaming services Session

Roger Barga offers an overview of Kinesis, Amazon’s data streaming platform, which includes Kinesis Firehose, Kinesis Analytics, and Kinesis Streams, and explains how customers have architected their applications using Kinesis services for low-latency and extreme scale.

Paul Barth is founder and CEO of Podium Data, creator of the industry-leading Podium data lake software platform, which is redefining enterprise data management. He has spent decades developing advanced data and analytics solutions for Fortune 100 companies and is a recognized thought leader on business-driven data strategies and best practices. Prior to founding Podium Data, Paul cofounded NewVantage Partners, a boutique consultancy advising C-level executives at leading banking, investment, and insurance firms. In his roles at Schlumberger, Thinking Machines, Epsilon, Tessera, and iXL, Paul led the discovery and development of parallel processing and machine-learning technologies to dramatically accelerate and simplify data management and analytics. Paul holds a PhD in computer science from MIT and an MS from Yale University.

Presentations

Astellas Pharma's marketing analytics data lake Tutorial

Launched in late 2015, Astellas's enterprise data lake project is taking the company on a data governance journey. Kishore Papineni offers an overview of the project, providing insights into some of the business pain points and key drivers, how it has led to organizational change, and the best practices associated with Astellas's new data governance process.

Alon Bartur brings a wealth of field experience to Trifacta’s product management team with his experience in product management, alliances, and sales engineering, where as director of product management, he works closely with customers and partners to drive the product roadmap and requirements for Trifacta. Prior to joining Trifacta, Alon worked at GoodData and Google.

Presentations

Beyond polarization: Data UX for a diversity of workers Session

Joe Hellerstein, Giorgio Caviglia, and Alon Bartur share their design philosophy for users and their experience designing UIs, illustrating their design principles with core elements from Trifacta, including the founding technology of predictive interaction, recent innovations like transform builder, and other developments in their core transformation experience.

Ryan Baumann is a sales and solutions engineer at Mapbox, where he integrates engineering and sales skills to help cities use open data to make better decisions and show how Mapbox tools can be used for spatial analysis and visualization. Ryan has experience building end-to-end solutions for mining and construction customers to improve safety, productivity, and availability. Previously, he worked as a solutions engineer at Caterpillar, where he built applications to make mining operations more efficient and environmentally friendly. As a former pro cyclist, Ryan loves all things outdoors. You’ll find him scouting out Marin County on his mountain bike or hiking Mt. Diablo on the weekends. Ryan has a bachelor’s degree in mechanical engineering from the University of Wisconsin-Madison.

Presentations

Transforming cities with Mapbox and open data Tutorial

Ryan Baumann explains how Mapbox Cities helps transform transportation and safety using open data, spatial analysis, and Mapbox tools.

Gil Benghiat is one of three founders of DataKitchen, a company on a mission to enable analytic teams to deliver value quickly and with high quality. Gil’s career has always been data-oriented and has included positions collecting and displaying network data at AT&T Bell Laboratories (now Alcatel-Lucent), managing data at Sybase (purchased by SAP), collecting and cleaning clinical trial data at PhaseForward (IPO then purchased by Oracle), integrating pharmaceutical sales data at LeapFrogRx (purchased by Model N), and liberating data at Solid Oak Consulting. Gil holds an MS in computer science from Stanford University and a BS in applied mathematics and biology from Brown University. He has hiked all 48 of New Hampshire’s 4,000 peaks and is now working on the New England 67.

Presentations

Seven steps to high-velocity data analytics with DataOps Session

Data analysts, data scientists, and data engineers are already working on teams delivering insight and analysis, but how do you get the team to support experimentation and insight delivery without ending up in an IT versus data engineer versus data scientist war? Christopher Bergh and Gil Benghiat present the seven shocking steps to get these groups of people working together.

Christopher Bergh is a founder and head chef at DataKitchen, where, among other activities, he leads DataKitchen’s Agile Data initiative. Chris has more than 25 years of research, engineering, analytics, and executive management experience. Previously, he was regional vice president in the Revenue Management Intelligence group in Model N; was COO of LeapFrogRx, a descriptive and predictive analytics software and service provider (where he led the 2012 acquisition of LeapFrogRx by Model N); was CTO and vice president of product management of MarketSoft (now part of IBM), an innovative enterprise marketing management software vendor; developed Microsoft Passport, the predecessor to Windows Live ID, a distributed authentication system used by hundreds of millions of users today (and was awarded a US patent for his work on that project); led the technical architecture and implementation of Firefly Passport, an early leader in internet personalization and privacy acquired by microsoft; and led the development of the first travel-related ecommerce web site at NetMarket. Chris began his career at the Massachusetts Institute of Technology’s Lincoln Laboratory and NASA’s Ames Research Center, where he created software and algorithms that provided aircraft arrival optimization assistance to air traffic controllers at several major airports in the United States. Chris also served as a Peace Corps volunteer math reacher in Botswana, Africa. Chris holds an MS from Columbia University and a BS from the University of Wisconsin-Madison. He is an avid cyclist, hiker, and reader and is the father of two teenagers.

Presentations

Seven steps to high-velocity data analytics with DataOps Session

Data analysts, data scientists, and data engineers are already working on teams delivering insight and analysis, but how do you get the team to support experimentation and insight delivery without ending up in an IT versus data engineer versus data scientist war? Christopher Bergh and Gil Benghiat present the seven shocking steps to get these groups of people working together.

Cesar Berho is a senior security researcher at Intel and a committer to the Apache Spot project. Cesar has 12 years of experience working within the cybersecurity industry in positions in operations, design, engineering, and research. Recently, he has been focusing on new ways to analyze telemetry sources with analytics and benchmarking security implementations.

Presentations

Paint the landscape and secure your data center with Apache Spot Session

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

Rajesh Bhargava is an engineering leader at Visa, where he currently leads the effort to offer the Hadoop Platform as a Service (PaaS). Prior to that, as a Director and founding member of a predictive analytics startup, he built scalable platforms for predictive analytics, segmentation & real-time recommendations. At Yahoo, as an engineering lead, he helped define and design a number of audience and advertiser analytics solutions. Rajesh holds a bachelor’s degree in computer science and a master’s in computer applications.

Presentations

Swipe, dip, and hover: Managing card payment data at Visa Session

Visa is transforming the way it manages data: database appliances are giving way to Hadoop and HBase, and proprietary ETL is being replaced by Spark. Nandu Jayakumar and Rajesh Bhargava discuss the adoption of big data practices at this conservative financial enterprise and contrasts it with the adoption of the same ideas at Nandu's previous employer, a web/ad-tech company.

Joseph Blue is a data scientist at MapR. Previously, Joe developed predictive models in healthcare for Optum (a division of UnitedHealth) as chief scientist and was the first fellow for Optum’s startup, Optum Labs. Before his time at Optum, Joe accumulated 10 years of analytics experience at LexisNexis, HNC Software, and ID Analytics (now LifeLock), specializing in business problems such as fraud and anomaly detection. He is listed on several patents.

Presentations

Applying machine learning to live patient data Session

Joseph Blue and Carol Mcdonald walk you through a reference application that processes ECG data encoding using HL7 with a modern anomaly detector, demonstrating how combining visualization and alerting enables healthcare professionals to improve outcomes and reduce costs and sharing lessons learned from their experience dealing with real data in real medical situations.

Ron Bodkin is CTO Architecture and Services for Teradata. Ron is responsible for leading the global emerging technology team focusing on Artificial Intelligence, GPU and Blockchain. Responsible for leading global consulting teams for enterprise analytics architectures combining Hadoop and Spark, public cloud and traditional data warehousing, driving strategic pillar for Teradata.

Previously, Ron was the founding CEO of Think Big Analytics. Think Big provides end to end support for enterprise Big Data including data science, data engineering, advisory and managed services and frameworks such as Kylo for enterprise data lakes. Think Big was acquired by Teradata in 2014 and was the leading global pure play big data services firm.

Previously, Ron was VP Engineering at Quantcast where he led the data science and engineer teams that pioneered the use of Hadoop and NoSQL for batch and real-time decision making. Prior to that, Ron was Founder of New Aspects, which provided enterprise consulting for Aspect-oriented programming. Ron was also Co-Founder and CTO of B2B applications provider C-Bridge, which he led to team of 900 people and a successful IPO. Ron graduated with honors from McGill University with a B.S. in Math and Computer Science. Ron also earned his Master’s Degree in Computer Science from MIT, leaving the PhD program after presenting the idea for C-bridge and placing in the finals of the 50k Entrepreneurship Contest.

Presentations

Driving enterprise open source adoption, from data lake to AI (sponsored by Teradata) Keynote

It is no surprise that reducing operational IT expenditures and increasing software capabilities is a top priority for large enterprises. Given its advantages, open source software has proliferated across the globe. Ron Bodkin explains how Teradata drives open source adoption inside enterprises using open source data management and AI techniques leveraged across the analytical ecosystem.

James Bradbury is a research scientist at Salesforce Research, where he works on cutting-edge deep learning models for natural language processing. James joined Salesforce with the April 2016 acquisition of MetaMind Inc., where he designed and implemented a neural machine translation system that won second place in the WMT 2016 machine-translation competition. He is a contributor to the Chainer and PyTorch deep learning software frameworks. James holds a degree in linguistics from Stanford University.

Presentations

PyTorch: A flexible and intuitive framework for deep learning Session

James Bradbury offers an overview of PyTorch, a brand-new deep learning framework from developers at Facebook AI Research that's intended to be faster, easier, and more flexible than alternatives like TensorFlow. James makes the case for PyTorch, focusing on the library's advantages for natural language processing and reinforcement learning.

Joseph Bradley is a software engineer working on machine learning at Databricks. Joseph is an Apache Spark committer and PMC member. Previously, he was a postdoc at UC Berkeley. Joseph holds a PhD in machine learning from Carnegie Mellon University, where he focused on scalable learning for probabilistic graphical models, examining trade-offs between computation, statistical efficiency, and parallelization.

Presentations

Best practices for deep learning on Apache Spark Session

Joseph Bradley and Tim Hunter share best practices for building deep learning pipelines with Apache Spark, covering cluster setup, data ingest, tuning clusters, and monitoring jobs—all demonstrated using Google’s TensorFlow library.

Office Hour with Joseph K. Bradley (Databricks) Office Hour

Joseph’s office hour is an excellent opportunity to discuss the best practices for building deep learning pipelines with Apache Spark, as well as any other topics about machine learning and graph processing on Spark you'd like to talk about (e.g., MLlib's current activity and roadmap, GraphFrames, and Databricks).

Matt Brandwein is director of product management at Cloudera, driving the platform’s experience for data scientists and data engineers. Before that, Matt led Cloudera’s product marketing team, with roles spanning product, solution, and partner marketing. Previously, he built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in computer science and mathematics from the University of Massachusetts Amherst.

Presentations

Making self-service data science a reality Session

Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality.

Claudiu Branzan is the director of data science at G2 Web Services where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine-learning and distributed-systems experience. Previously, Claudiu worked for Atigeo Inc, building big data and data science-driven products for various customers.

Presentations

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies Session

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Kurt Brown leads the Data Platform team at Netflix. Kurt’s group architects and manages the technical infrastructure underpinning the company’s analytics, which includes various big data technologies like Hadoop, Spark, and Presto, Netflix open sourced applications and services such as Genie and Lipstick, and traditional BI tools including Tableau and Redshift.

Presentations

Office Hour with Kurt Brown (Netflix) Office Hour

Stop by and chat with Kurt about anything (big) data infrastructure related.

The Netflix data platform: Now and in the future Session

The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds.

Charlie Burgoyne is the principal director of data science at frog, where he guides the vision and implementation of the new data science organization and initiatives and helps frog complement its traditional process with rigorous data science. Charlie leads a team of highly trained scientists and engineers from several studios across the world to implement advanced analytics, machine learning, and artificial intelligence into frog products. Previously, Charlie held a variety of roles including director of data science at Rosetta Stone, vice president of R&D for a government contracting firm specializing in cybersecurity and machine learning, a research physicist for the DOE and NNSA, and a research astrophysicist for NASA in conjunction with George Washington University. Charlie holds a master’s degree in astrophysics from Georgetown University and a bachelor’s in nuclear physics from George Washington University. He has a real passion for languages and speaks French, German, and Italian.

Presentations

Bringing data into design: How to craft personalized user experiences Session

From personalized newsfeeds to curated playlists, users want tailored experiences when they interact with their devices. Ricky Hennessy and Charlie Burgoyne explain how frog’s interdisciplinary teams of designers, technologists, and data scientists create data-driven, personalized, and adaptive user experiences.

James Burkhart is the technical lead on real-time data infrastructure at Uber. James has a strong background in time series data storage, processing, and retrieval. Previously, he worked on Blueflood, a time series database on top of Cassandra, while at Rackspace.

Presentations

Real-time analytics at Uber scale (sponsored by MemSQL) Session

James Burkhart explains how Uber supports millions of analytical queries daily across real-time data with Apollo.

Mark Burnette is a director of sales engineering and major accounts at Pentaho, where he leads teams of engineers across western US and Japan that focus on designing and proving out big data and embedded solutions for Fortune 500 companies, including cybersecurity, telematics, mobile network optimization, data quality as a service (DQaaS), and audience behavior analytics. Mark also founded boutique IT consulting practice Synergetic Consulting, which has been providing custom enterprise application and data solutions since 1995.

Presentations

Five steps to a killer data lake, from ingest to machine learning (sponsored by Pentaho) Session

Mark Burnette outlines five keys to success with data lakes and explores several real-world data lake implementations that are changing the world.

Michelle Casbon is director of data science at Qordoba. Michelle’s development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. Previously, she was a senior data science engineer at Idibon, where she built tools for generating predictions on textual datasets. She loves working with open source projects and has contributed to Apache Spark and Apache Flume. Her writing has been featured in the AI section of O’Reilly Radar. Michelle holds a master’s degree from the University of Cambridge, focusing on NLP, speech recognition, speech synthesis, and machine translation.

Presentations

Machine learning to automate localization with Apache Spark and other open source tools Session

Supporting multiple locales involves the maintenance and generation of localized strings. Michelle Casbon explains how machine learning and natural language processing are applied to the underserved domain of localization using primarily open source tools, including Scala, Apache Spark, Apache Cassandra, and Apache Kafka.

Jorge Castañón hails from Mexico City and received his Ph.D. in Computational and Applied Mathematics from Rice University. He has a genuine passion for data science and machine learning applications of any kind, especially imaging problems. Since 2007, he is been developing numerical optimization models and algorithms for machine learning and inverse problems. At IBM, Jorge joined the Analytics team at Silicon Valley Laboratory where he is building the future of machine learning and text analytics tools.

Presentations

Top enterprise use cases for streaming and machine learning (sponsored by IBM) Session

Roger Rea and Jorge Castanon outline the top enterprise use cases for streaming and machine learning.

Sarah Catanzaro is an investor at Canvas Ventures, where she focuses on analytics, data infrastructure, and machine intelligence. Sarah has several years of experience in developing data acquisition strategies and leading machine and deep learning-enabled product development at organizations of various sizes. Most recently, she led the data team at Mattermark to collect and organize information on over one million private companies. Previously, she implemented analytics solutions for municipal and federal agencies as a consultant at Palantir and as an analyst at Cyveillance. She also led projects on adversary behavioral modeling and Somali pirate network analysis as a program manager at the Center for Advanced Defense Studies. Sarah holds a BA in international security studies from Stanford University.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Giorgio Caviglia is principal UX designer at Trifacta. Previously, Giorgio was part of the Center for Spatial and Textual Analysis at Stanford University, where he designed digital tools to support scholarly research in the digital humanities. Giorgio has been involved in research, consulting and teaching activities at both public and private institutions, such as Stanford University, Politecnico di Milano, ISIA Urbino, IULM, and Accurat. His research focuses on data-driven visualizations and interfaces for the humanities and social sciences. Giorgio’s work has been featured at numerous international conferences and venues, including SIGGRAPH, MIT, Harvard University, MediaLAB Prado, Expo 2010 Shanghai, and Triennale di Milano, and in publications and showcases, such as Visual Complexity, Malofiej, Data Flow, Design for Information, Fast Company, Gizmodo, Gigaom, and Wired. Giorgio holds an MSc in communication design and a PhD in design from the Politecnico di Milano, where he was part of the DensityDesign lab from its beginning until 2013.

Presentations

Beyond polarization: Data UX for a diversity of workers Session

Joe Hellerstein, Giorgio Caviglia, and Alon Bartur share their design philosophy for users and their experience designing UIs, illustrating their design principles with core elements from Trifacta, including the founding technology of predictive interaction, recent innovations like transform builder, and other developments in their core transformation experience.

Anjaneya “Reddy” Chagam is a principal engineer and chief SDS architect in Intel’s Data Center group, where he is responsible for developing software-defined storage strategy, architecture, and technology initiatives within Intel. Reddy has 20 years of software development, architecture, and systems engineering expertise in enterprise and cloud-computing environments. Previously, he worked in Intel’s Technology and Manufacturing division, where he was responsible for delivering automation solutions across eight process technology generations and was instrumental in delivering IA-based mission-critical manufacturing solutions. Reddy also worked with several Fortune 500 companies to deliver scalable data-center solutions using Intel technologies. Reddy holds a bachelor’s degree and a master’s degree in computer science and engineering.

Presentations

Ushering in a new era of hyperconverged big data: Hadoop over vSAN (cosponsored by VMware and Intel) Session

Vahid Fereydouny and Anjaneya Chagam share the results of running Hadoop workloads on a standard all-flash vSAN cluster, unleashing the simplicity and power of big data in a hyperconverged environment.

Vinoth Chandar works on data infrastructure at Uber, with a focus on Hadoop and Spark. Vinoth has keen interest in unified architectures for data analytics and processing. Previously, Vinoth was the LinkedIn lead on Voldemort and worked on Oracle server’s replication engine, HPC, and stream processing.

Presentations

Hoodie: Incremental processing on Hadoop at Uber Session

Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. Vinoth Chandar and Prasanna Rajaperumal introduce Hoodie, a newly open sourced system at Uber that adds new incremental processing primitives to existing Hadoop technologies to provide near-real-time data at 10x reduced cost.

Office Hour with Vinoth Chandar (Uber) Office Hour

Uber has added new incremental processing primitives to existing Hadoop technologies for near-real-time use cases. Stop by Vinoth’s office hour to find out how.

Alan Chaney is chief architect and vice president of engineering at Bitvore Corp. Alan started down the technology path years before he knew anything about software. A guy who’s always been fascinated by how things work, he fed his curiosity by disassembling everything from clocks to motorcycles. When he learned that software offered the same sort of intellectual stimulation (but without the solder burns), he quickly changed his focus from rewiring objects to writing code. During his career as an academic and entrepreneur, Alan dedicated himself to reinventing everything from networked storage to streaming media, always focused on doing things better, faster, and more elegantly than previously imagined.

Presentations

Delivering relevant filtered news to save hours of drudgery each day for fixed-income securities analysts Session

Bitvore Corp’s Bitvore for Munis personalized news surveillance system is rapidly becoming a must-have for all major fixed-income securities analysts, investors, and brokers working in the three-trillion-dollar municipal bond market in the USA. Alan Chaney explains how Bitvore delivers the few important and relevant articles out of thousands each day, saving users many hours daily.

Bryan Cheng is a backend developer and analytics lead at BlockCypher. Since 2015, he has worked on infrastructure powering bitcoin and other blockchains. As analytics lead, Bryan works to combine BlockCypher’s experience with blockchains of all sizes with the latest in machine-learning and big data analytics to help governments and private industry stay informed and secure. Previously, Bryan cofounded a startup and led a network access control team at UC Berkeley, where he graduated with a BS in materials science and mechanical engineering. When not hacking in Spark or writing Golang, Bryan can be found learning Rust, riding his bike, and exploring VR.

Presentations

Spark, GraphX, and blockchains: Building a behavioral analytics platform for forensics, fraud, and finance Session

Bryan Cheng and Karen Hsu describe how they built machine-learning and graph traversal systems on Apache Spark to help government organizations and private businesses stay informed in the brave new world of blockchain technology. Bryan and Karen also share lessons learned combining these two bleeding-edge technologies and explain how these techniques can be applied to private and federated chains.

Slava Chernyak is a senior software engineer at Google. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Google Cloud Dataflow Streaming from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.

Presentations

Ask me anything: Apache Beam AMA

Join Tyler Akidau, Frances Perry, Kenneth Knowles, and Slava Chernyak to discuss anything related to Apache Beam.

Watermarks: Time and progress in Apache Beam (incubating) and beyond Session

Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications.

Yohan Chin works as a head of data science at Tapjoy, – focus on personalized mobile advertising, audience targeting, in-app user LTV maximization. His team works on data science modeling/data science back-end engineering/data insight visualization. Prior to Tapjoy, he was a lead research scientist at MySpace, where he led to build personalized music/video recommendation and multiple other data science products with social networking data. He has Ph.D. in Computer Science from University of Texas Dallas, B.S. from Seoul National University.

Presentations

Building a real-time data science service for mobile advertising Tutorial

To ensure that users have the best application experience, Tapjoy has architected a data science service to handle ad-requests optimization and personalization in real time. Robin Li shares the critical considerations for building such a Lambda architecture and details the methods Tapjoy used to evaluate and implement its real-time architecture.

Darren Chinen is the senior director of data science and engineering at Malwarebytes. Previously, Darren implemented big data solutions on Hadoop at GoPro and Apple. Living in data for almost 20 years, he cut his teeth as a data engineer on traditional data warehouse technologies and has led EDW transition projects to Hadoop as well as new big data implementations. Darren is a Capricorn. He likes long walks on the beach, candlelit dinners, and puppies.

Presentations

Building an automation-driven Lambda architecture (sponsored by BMC) Session

Darren Chinen, Sujay Kulkarni, and Manjunath Vasishta demonstrate how to use a Lambda architecture to provide real-time views into big data by combining batch and stream processing, leveraging BMC’s Control-M as a critical component of both batch processing and ecosystem management.

Rumman Chowdhury is a senior manager and AI lead at Accenture, where she works on cutting-edge applications of artificial intelligence and leads the company’s responsible and ethical AI initiatives. She also serves on the board of directors for three AI startups. Rumman’s passion lies at the intersection of artificial intelligence and humanity. She comes to data science from a quantitative social science background. She has been interviewed by Software Engineering Daily, the PHDivas podcast, German Public Television, and fashion line MM LaFleur. In 2017, she gave talks at the Global Artificial Intelligence Conference, IIA Symposium, ODSC Masterclass, and the Digital Humanities and Digital Journalism conference, among others. Rumman holds two undergraduate degrees from MIT and a master’s degree in quantitative methods of the social sciences from Columbia University. She is near completion of her PhD from the University of California, San Diego.

Presentations

Visualizing the history of San Francisco Session

In collaboration with the Gray Area Foundation for the Arts and Metis Data Science, Rumman Chowdhury created an interactive data art installation with the purpose of educating San Franciscans about their own city. Rumman discusses the challenges of using historical, predigital-era data with D3 and R to craft a compelling and educational story residing at the intersection of art and technology.

Ira Cohen is a cofounder of Anodot and its chief data scientist, where he is responsible for developing and inventing its real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

The app trap: Why every mobile app needs anomaly detection Session

Apps have so many moving parts that a simple change to one element can cause havoc somewhere else. The resulting issues annoy users and cause revenue leaks. Ira Cohen outlines ways to use anomaly detection to monitor all areas of an app, from the code to the user behavior to partner integrations and more, to fully optimize your mobile app.

Robert Cohen is a senior fellow at the Economic Strategy Institute, where he is directing a new study to examine the economic and business impacts of virtualization of compute, storage and networking infrastructure, big data, and the internet of things—the “new IP.” Robert is formulating a series of case studies of firms that are early adopters of these technologies and would appreciate any inquiries from firms that would like to contribute to this analysis.

Presentations

The programmable enterprise: Software is central to innovation Session

Programmable enterprises are developing their businesses around cloud computing, big data, and the internet of things. Robert Cohen explores how infrastructure changes will alter corporate use of software, skilled employees, and strategies, the business and economic impacts of these changes, and the broader impacts of these shifts on our economy and society.

Christopher Colburn is just another data scientist at Netflix.

Presentations

Going real time: Creating online datasets for personalization Session

In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix.

Eric Colson is chief algorithms officer at Stitch Fix as well as an advisor to several big data startups. Previously, Eric was vice president of data science and engineering at Netflix. He holds a BA in economics from SFSU, an MS in information systems from GGU, and an MS in management science and engineering from Stanford.

Presentations

Office Hour with Eric Colson (Stitch Fix) Office Hour

Join Eric to learn from his experience building large data science teams at companies like Netflix and Stitch Fix.

Organizing for data science: Some unintuitive lessons learned for unlocking value Session

Data scientists blend the skills of statisticians, software engineers, and domain experts to create new roles. Data science isn't merely an amalgam of disciplines but rather a gestalt which synthesizes the ethos of various fields. This merits new thinking when it comes to organization. Eric Colson explores some novel—and often unintuitive—ways to unleash the value of your data science team.

Dustin Cote is a customer operations engineer at Confluent. Over his career, Dustin has worked in a variety of roles from Java developer to operations engineer. His most recent focus is distributed systems in the big data ecosystem, with Apache Kafka being his software of choice.

Presentations

Mistakes were made, but not by us: Lessons from a year of supporting Apache Kafka Session

Dustin Cote and Ryan Pridgeon share their experience troubleshooting Apache Kafka in production environments and discuss how to avoid pitfalls like message loss or performance degradation in your environment.

Rob Craft is the lead product manager on the Cloud Machine Learning team for Google Cloud Platform. Rob has spent his career focusing on building great products on the leading edge of what’s next, from his early days working on OS kernels and managed languages and high-performance computing to his current role transforming the cloud with machine learning.

Presentations

Machine learning at Google (sponsored by Google) Keynote

Rob Craft shares some of the ways machine learning is being used inside of Google, explores cloud-based neural networks, and discusses some customer use cases.

Machine learning with Google Cloud Platform (sponsored by Google) Session

Rob Craft explores machine learning and predictive analytics, explaining how you can leverage the power of ML whether you have a machine-learning team of your own or just want to use ML as a service.

Charlotte Crain is a principal solutions architect in SAS’s Global Technology practice, where she works with SAS sales teams globally, product management, SAS R&D, product marketing, professional services and partners, education, and technical support and uses her 15 years of combined experience in SAS data management methodology/architecture and governance, statistical modeling, SAS architecture, deployments, SAS programming, and applications development to engage meaningfully with customers for enterprise decision management, deployment, and integration. Charlotte holds both a BS and an MS in mathematics with an emphasis in numerical methods and analysis and linear/nonlinear modeling.

Presentations

Outsmarting insider threats: Safeguarding your most sensitive assets (sponsored by SAS) Session

Reflecting the old horror gimmick "the call that comes from inside the house," an increasing number of data breaches are carried out by insiders. Charlotte Crain and Tyler Freckman share a unique, hybrid approach to insider threat deterrence that combines traditional detection methods and investigative methodologies with behavioral analysis to enable complete, continuous monitoring of activity.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata + Hadoop World conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Thursday keynote welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Beau Cronin is the lead developer for Embedding.js, a library for data-driven immersive environments. Beau cofounded two startups based on probabilistic inference; the second was acquired by Salesforce in 2012. Recently, he has become increasingly focused on web-based virtual reality for data visualization. Beau holds a PhD in computational neuroscience from MIT.

Presentations

Launching Pokémon GO Keynote

Pokémon GO was one of the fastest-growing games of all time, becoming a worldwide phenomenon in a matter of days. In conversation with Beau Cronin, Phil Keslin, CTO of Niantic, explains how the engineering team prepared for—and just barely survived—the experience.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynote welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Michelangelo D’Agostino is the director of data science R&D at Civis Analytics, where he leads a team that develops statistical models and writes software to help companies and nonprofits leverage their data. As a reformed particle physicist turned data scientist, Michelangelo loves mungeable datasets, machine learning, and long walks on the beach (with a floppy hat, plenty of sunscreen, and a laptop). Michelangelo came to Civis from Braintree, a Chicago-based online payments company that was acquired by PayPal. Prior to Braintree, he was a senior analyst in digital analytics with the 2012 Obama re-election campaign. He helped to optimize the campaign’s email fundraising juggernaut and analyzed social media data. Michelangelo has been a mentor with the Data Science for Social Good Fellowship. He holds a PhD in particle astrophysics from UC Berkeley and got his start in analytics sifting through neutrino data from the IceCube experiment. Accordingly, he spent two glorious months at the South Pole, where he slept in a tent salvaged from the Korean War and enjoyed the twice-weekly shower rationing. He’s also written about science and technology for the Economist.

Presentations

The power of persuasion modeling Session

How do we know that an advertisement or promotion truly drives incremental revenue? Michelangelo D'Agostino and Bill Lattner share their experience developing machine-learning techniques for predicting treatment responsiveness from randomized controlled experiments and explore the use of these “persuasion” models at scale in politics, social good, and marketing.

Shirshanka Das is the architect for LinkedIn’s Data Analytics Infrastructure team. Shirshanka was one of the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. His current focus at LinkedIn includes all things Hadoop, high-performance distributed OLAP engines, large-scale data ingestion, transformation and movement, and data lineage and discovery.

Presentations

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn Session

Shirshanka Das and Yael Garten share best practices learned using Kafka and Hadoop as the foundation of a petabyte-scale data warehouse at LinkedIn, offering concrete suggestions to help you process data seamlessly. Along the way, Shirshanka and Yael discuss their experience running governance to empower data teams.

Tathagata Das is an Apache Spark committer and a member of the PMC. He is the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.

Presentations

Making Structured Streaming ready for production: Updates and future directions Session

Apache Spark 2.0 introduced the core APIs for Structured Streaming, a new streaming processing engine on Spark SQL. Since then, the Spark team has focused its efforts on making the engine ready for production use. Michael Armbrust and Tathagata Das outline the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases.

Prior to joining Amplify as a general partner, Mike Dauber spent over six years at Battery Ventures, where he led early-stage enterprise investments on the West Coast, including Battery’s investment in a stealth security company that is also in Amplify’s portfolio. Most recently, Mike sat on the boards of Continuuity, Duetto, Interana, and Platfora. Mike previously invested in Splunk and RelateIQ, which was recently acquired by Salesforce. Mike began his career as a hardware engineer at a startup and later held product, business development, and sales roles at Altera and Xilinx. Mike is a frequent speaker at conferences and is on the advisory board of both the O’Reilly Strata Conference and SXSW. He was named to Forbes magazine’s 2015 Midas Brink List. Mike holds a BS in electrical engineering from the University of Michigan in Ann Arbor and an MBA from the University of Pennsylvania’s Wharton School.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Monal Daxini is an engineering manager at Netflix, where he is building a scalable and multitenant event processing pipeline and leads the infrastructure for stream processing as a service. He has worked on Netflix’s Cassandra and Dynamite infrastructure and was instrumental in developing the encoding compute infrastructure for all Netflix content. Monal has 15 years of experience building distributed systems at organizations like Netflix, NFL.com, and Cisco.

Presentations

Going real time: Creating online datasets for personalization Session

In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix.

Netflix Keystone SPaaS: Real-time stream processing as a service Session

Netflix Keystone processes over a trillion events per day with at-least-once processing semantics in the cloud. Monal Daxini explores what it means to offer stream processing as a service (SPaaS), how Netflix implemented a scalable, fault-tolerant multitenant SPaaS internal offering, and how it evolved the system in flight with no downtime.

Gabriela de Queiroz is a data scientist at Sharethrough, where she develops statistical models from concept creation to production, designs, runs, and analyzes experiments, and employs a variety of techniques to derive insights and drive data-centric decisions. Gabriela is the founder of R-Ladies, an organization created to promote diversity in the R community, which now has over 25 chapters worldwide. Currently, she is developing an online course on machine learning in partnership with DataCamp.

Presentations

Stats: What you need to know Tutorial

Data science is not only about machine learning. To be a successful data person, you also need a significant understanding of statistics. Gabriela de Queiroz walks you through the top five statistical concepts you need to know to work with data.

Danielle Dean is a senior data scientist lead at Microsoft in the Algorithms and Data Science group within Cloud and Enterprise, where she leads a team of data scientists and engineers on end-to-end analytics projects using Microsoft’s Cortana Intelligence Suite—from automating the ingestion of data to analysis and implementation of algorithms, creating web services of these implementations, and using those to integrate into customer solutions or build end-user dashboards and visualizations. Danielle holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill, where she studied the application of multilevel event history models to understand the timing and processes leading to events between dyads within social networks.

Presentations

Using big data, the cloud, and AI to enable intelligence at scale (sponsored by Microsoft) Session

Wee Hyong Tok and Danielle Dean explain how the global, trusted, and hybrid Microsoft platform can enable you to do intelligence at scale, describing real-life applications where big data, the cloud, and AI are making a difference and how this is accelerating the digital transformation for these organizations at a lighting pace.

Bill Dentinger is the vice president of products for Ryft.

Presentations

Delivering fast, simple cloud-based data analytics: Leveraging heterogeneous compute in any architecture (sponsored by Ryft Ex) Session

The massive shift of data to the cloud is exacerbating data preparation and transport complexities that slow data analytics to a crawl. Bill Dentinger explains how the deployment of FPGA/x86-based heterogeneous compute architectures by cloud vendors is giving all organizations the opportunity to speed their data analytics to unprecedented levels.

Renee DiResta is the vice president of business development at Haven, a private marketplace for booking ocean freight shipments. Previously, Renee was a principal at seed-stage VC fund O’Reilly AlphaTech Ventures (OATV) and spent seven years as a trader at Jane Street Capital, a quantitative proprietary trading firm in New York City. Renee is interested in improving liquidity and transparency in private markets and enjoys investing in and advising hardware startups.

Presentations

How the shipping industry can become more data driven Tutorial

Data is transforming global trade. Using examples from historical trade and their work at Haven, Renee DiResta and Coco Krumme explore three frictions in logistics and container shipping—price opacity, inefficient markets, and unstructured data—and identify the important ways in which data will change how we price and exchange goods worldwide.

Gillian Docherty is the CEO of The Data Lab, one of eight Innovation Centre’s across Scotland, where she is responsible for delivering the strategic vision set out by The Data Lab Board, the aim of which is to create over 250 new jobs and generate more than £100 million for the economy. Gillian has over 22 years’ experience working in the IT sector. Previously, she held a range of senior leadership roles at IBM UK, including leader for software business in Scotland, systems and technology sales leader and territory leader for general business Scotland. Gillian is on the board of Tech Partnership Scotland and is also a board member of the Glasgow Chamber of Commerce. Gillian holds a degree in computing science from Glasgow University. She is married and has a daughter.

Presentations

Data-driven innovation Session

Gillian Docherty shares her experience leading The Data Lab, an innovation center focused on helping organizations drive economic and social benefit through data science and analytics. Along the way, Gillian discusses some of the projects her teams have supported, from multinationals to startups, and explains how they leverage academic capability to help drive innovation from data.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Mark Donsky, André Araujo, Michael Yoder, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Big data governance for the hybrid cloud: Best practices and how-to Session

Big data needs governance. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start—especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Sudhanshu Arora share a step-by-step approach to kick-start your big data governance initiatives.

Peng Du is a senior software engineer in Uber. He holds a PhD in computer science and an MA in applied mathematics, both from the University of California, San Diego.

Presentations

Uber's data science workbench Session

Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring.

Ted Dunning has been involved with a number of startups—the latest is MapR Technologies, where he is chief application architect working on advanced Hadoop-related technologies. Ted is also a PMC member for the Apache Zookeeper and Mahout projects and contributed to the Mahout clustering, classification, and matrix decomposition algorithms. He was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Presentations

Tensor abuse in the workplace Session

Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case).

Turning the internet upside down: Driving big data right to the edge (sponsored by MapR) Keynote

The internet of things is turning the internet upside down, and the effects are causing all kinds of problems. We have to answer questions about how to have data where we want it and computation where we need it—and we have to coordinate and control all of this while maintaining visibility and security. Ted Dunning shares solutions for this problem from across multiple industries and businesses.

Mike Dusenberry is an engineer at the IBM Spark Technology Center, where he is creating a deep learning library for SystemML and solving for performant deep learning at scale. Mike was on his way to an MD and a career as a physician in his home state of North Carolina when he teamed up with professors on a medical machine-learning research project. Two years later in San Francisco, Mike is contributing to Apache SystemML as a committer and researching medical applications for deep learning.

Presentations

Leveraging deep learning to predict breast cancer proliferation scores with Apache Spark and Apache SystemML Session

Estimating the growth rate of tumors is a very important but very expensive and time-consuming part of diagnosing and treating breast cancer. Michael Dusenberry and Frederick Reiss describe how to use deep learning with Apache Spark and Apache SystemML to automate this critical image classification task.

Michael Eacrett is vice president of product management leading SAP’s in-memory distributed computing platform, SAP HANA Vora, big data and IoT platforms, and enterprise information management products, where he defines product strategy, manages product requirements and the partner ecosystem, and enables product go-to-market. Most recently, Michael built up and led the SAP HANA PM team developing the HANA product and business from initial launch to over $2 billion in product sales. Michael has over 20 years of industry experience in product management, strategic consulting, and implementation in both America and Europe.

Presentations

Modernizing business processes with big data: Real-world use cases for production (sponsored by SAP) Session

Ken Tsai and Michael Eacrett explore critical components of enterprise production environments that support day-to-day business processes while ensuring security, governance, and operational administration and share best practices to ensure business value.

Office Hour with Ken Tsai and Michael Eacrett (SAP) Office Hour

If you need your enterprise production environment to not only support day-to-day business processes but also ensure security, governance, and operational administration, Ken and Michael can offer tips, tricks, and best practices and answer your questions.

Joey Echeverria is the director of engineering at Rocana, where he builds applications for scaling IT operations built on the Apache Hadoop platform. Joey is a committer on the Kite SDK, an Apache-licensed data API for the Hadoop ecosystem. Joey was previously a software engineer at Cloudera, where contributed to several ASF projects including Apache Flume, Apache Sqoop, Apache Hadoop, and Apache HBase. Joey is also a coauthor of Hadoop Security, published by O’Reilly Media.

Presentations

Debugging Apache Spark Session

Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.

Barbara Eckman is a Principal Data Architect at Comcast. She leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing Big Data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project Center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.

Presentations

Data integration and governance for big data with Apache Avro; or, How to solve the GIGO problem Tutorial

Big data famously enables anyone to contribute to the enterprise data store. Integrating previously siloed data can uncover powerful insights for the business. But without data governance, inefficiencies and incorrect business decisions may result. Barbara Eckman explains how Comcast is using Apache Avro for enterprise data governance, the challenges faced, and methods to address these challenges.

Michael Edwards’ idea of full stack developer extends from interaction design in big data analytics systems down to clock/data recovery in backscatter-modulated RF protocols. He’s all about scale-up, cost-down, with additional areas of focus in authentication/access control and easy-to-integrate data visualization components.

Presentations

Operating Kafka at petabyte scale Session

Michael Edwards shares experiences from operating several Kafka clusters in a real-time streaming event ingestion pathway. He'll discuss the lessons learned from working with hundreds of terabytes flowing through every day, petabytes of retention, and gigabytes of historical data streaming to and from storage.

Laura Eisenhardt is EVP at iKnow Solutions Europe and the founder of DigitalConscience.org, a CSR platform designed to create opportunities for technical resources (specifically expats) to give back to communities with their unique skills while making a huge impact locally. Laura has led massive programs for the World Health Organization across Africa, collecting big data in over 165 languages, and specializes in data quality and consistency. Laura is also COO for the American Institute of Minimally Invasive Heart Surgery (AIMHS.org), a nonprofit designed to educate the public and heart surgeons worldwide on how to do open heart surgery without splitting open the chest. Why? People that have complex heart surgery in a minimally invasive procedure return to work in two weeks versus 9–12 months, which has a substantial impact on society, family finances, depression, and cost for all.

Presentations

Big data as a force for good Session

In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved.

Stephen Elston is an experienced big data geek, data scientist, and software business leader. Steve is principal consultant at Quantia Analytics, LLC, where he leads the building of new business lines, manages P&L, and takes software products from concept and financing through development, intellectual property protection, sales, customer shipment, and support. Steve is also an instructor for the University of Washington data science program. Steve has over two decades of experience in visualization, predictive analytics and machine learning, at scales from small to massive, using many platforms including Hadoop, Spark, R, S/SPLUS, and Python. He has created solutions in fraud detection, capital markets, wireless systems, law enforcement, and streaming analytics for the IoT.

Presentations

Exploration and visualization of large, complex datasets with R, Hadoop, and Spark Tutorial

Divide and recombine techniques provide scalable methods for exploration and visualization of otherwise intractable datasets. Stephen Elston and Ryan Hafen lead a series of hands-on exercises to help you develop skills in exploration and visualization of large, complex datasets using R, Hadoop, and Spark.

Susan Eraly is a software engineer at Skymind, where she contributes to Deeplearning4j. Previously, Susan worked as a senior ASIC engineer at NVIDIA and as a data scientist in residence at Galvanize.

Presentations

Scalable deep learning for the enterprise with DL4J Tutorial

Dave Kale, Susan Eraly, and Josh Patterson explain how to build, train, and deploy neural networks using Deeplearning4j. Topics include the fundamentals of deep learning, ND4J and DL4J, and scalable training using GPUs and Apache Spark. You'll gain hands-on experience with several models, including convolutional and recurrent neural nets.

Yuliya Feldman recently joined Dremio Corporation as a principal software engineer, where she works on a number of products and features. Previously Yuliya was at MapR working on MapR admin infrastructure, the MapReduce framework, and YARN. Prior to that, Yuliya worked in diverse areas of satellite image processing, medical robotics, satellite phone communication, ecommerce, and big data.

Presentations

Pluggable security in Hadoop Session

Security will always be very important in the world of big data, but the choices today mostly start with Kerberos. Does that mean setting up security is always going to be painful? What if your company standardizes on other security alternatives? What if you want to have the freedom to decide what security type to support? Yuliya Feldman and Bill ODonnell discuss your options.

Vahid Fereydouny is part of the vSAN product team at VMware, where he focuses on driving the vision and roadmap for the product and helping scale the business.

Presentations

Ushering in a new era of hyperconverged big data: Hadoop over vSAN (cosponsored by VMware and Intel) Session

Vahid Fereydouny and Anjaneya Chagam share the results of running Hadoop workloads on a standard all-flash vSAN cluster, unleashing the simplicity and power of big data in a hyperconverged environment.

Jake Flomenberg is a partner at Accel, where he focuses on next-generation infrastructure, enterprise software, and security investments. Jake is part of the team responsible for Accel’s Big Data Fund and led investments in Demisto, Origami Logic, Sumo Logic, Trifacta, and Zoomdata. Previously, Jake was director of product management at Splunk, where he was responsible for the product’s user interface and big data strategy; worked at Cloudera, where he helped the founding team tackle a broad array of sales, marketing, and product issues; and was a member of Lockheed Martin’s Engineering Leadership Development Program. Jake hails from Cherry Hill, New Jersey. He holds a bachelor’s degree from Duke University, a master’s degree in engineering from the University of Pennsylvania, and an MBA from Harvard Business School.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Avrilia Floratau is a senior scientist at Microsoft’s Cloud and Information Services Lab, where her research is focused on scalable real-time stream processing systems. She is also an active contributor to Heron, collaborating with Twitter. Previously, Avrilia was a research scientist at IBM Research working on SQL-on-Hadoop systems. She holds a PhD in data management from the University of Wisconsin-Madison.

Presentations

From rivulets to rivers: Elastic stream processing in Heron Session

Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience.

Valentine Fontama is a principal data scientist manager on Microsoft’s Analytics + Insights Data Science team that delivers analytics capabilities across Azure and C+E cloud services. Previously, he was a new technology consultant at Equifax in London, where he pioneered the use of data mining to improve risk assessment and marketing in the consumer credit industry; principal data scientist in the Data & Decision Sciences Group (DDSG), where he led consulting to external customers, including ThyssenKrupp and Dell; and a senior product manager for big data and predictive analytics in cloud and enterprise marketing at Microsoft, where he led product management for Azure Machine Learning, HDInsight, Parallel Data Warehouse (Microsoft’s first ever data warehouse appliance), and three releases of Fast Track Data Warehouse. He has published 11 academic papers and coauthored three books on big data: Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes (2 editions) and Introducing Microsoft Azure HDInsight. Val holds an MBA in strategic management and marketing from the Wharton School, a PhD in neural networks, an MS in computing, and a BS in mathematics and electronics.

Presentations

How Microsoft predicts churn of cloud customers using deep learning and explains those predictions in an interpretable way Session

Although deep learning has proved to be very powerful, few results are reported on its application to business-focused problems. Feng Zhu and Val Fontama explore how Microsoft built a deep learning-based churn predictive model and demonstrate how to explain the predictions using LIME—a novel algorithm published in KDD 2016—to make the black box models more transparent and accessible.

Rodrigo Fontecilla is vice president and global lead for analytics for Unisys Federal Systems, where he leads all aspects of software development, system integration, mobile development, and data management focused on the federal government. Rod is responsible for providing leadership, coordination, and oversight on all IT solutions, emerging technologies, and IT services delivery to the federal government. He has more than 25 years of professional experience in the capture, design, development, implementation, and management of information management systems delivering mission-critical IT solutions and has an extensive background and expertise in cloud computing, mobile development, social media, enterprise architecture, data analytics, SOA-based solutions, and IT governance.

Presentations

Machine-learning opportunities within the airline industry Tutorial

Rodrigo Fontecilla explains how many of the largest airlines use different classes of machine-learning algorithms to create robust and reusable predictive models to provide a holistic view of operations and provide business value.

Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

Deploying and operating big data analytic apps on the public cloud Tutorial

Jennifer Wu, Eugene Fratkin, Andrei Savu, and Tony Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Practical considerations for running Spark workloads in the cloud Session

Both Spark workloads and use of the public cloud have been rapidly gaining adoption in mainstream enterprises. Anand Iyer and Eugene Fratkin discuss new developments in Spark and provide an in-depth discussion on the intersection between the latest Spark and cloud technologies.

Presentations

Outsmarting insider threats: Safeguarding your most sensitive assets (sponsored by SAS) Session

Reflecting the old horror gimmick "the call that comes from inside the house," an increasing number of data breaches are carried out by insiders. Charlotte Crain and Tyler Freckman share a unique, hybrid approach to insider threat deterrence that combines traditional detection methods and investigative methodologies with behavioral analysis to enable complete, continuous monitoring of activity.

Michael J. Freedman is a professor in the Computer Science department at Princeton University as well as the cofounder and CTO of Timescale, which provides an open source time series database optimized for fast ingest and complex queries. His research broadly focuses on distributed systems, networking, and security. He developed and operates several self-managing systems, including CoralCDN (a decentralized content distribution network) and DONAR (a server resolution system that powered the FCC’s Consumer Broadband Test), both of which serve millions of users daily. Michael’s other research has included software-defined and service-centric networking, cloud storage and data management, untrusted cloud services, fault-tolerant distributed systems, virtual world systems, peer-to-peer systems, and various privacy-enhancing and anticensorship systems. Michael’s work on IP geolocation and intelligence led him to cofound Illuminics Systems, which was acquired by Quova (now part of Neustar). His work on programmable enterprise networking (Ethane) helped form the basis for the OpenFlow/software-defined networking (SDN) architecture. His honors include the Presidential Early Career Award for Scientists and Engineers (PECASE), a Sloan fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. Michael holds a PhD in computer science from NYU’s Courant Institute and both an SB and an MEng degree from MIT.

Presentations

Designing a time series database to support IoT workloads Session

IoT applications often need more-complex queries than those supported by traditional time series databases. Michael Freedman outlines a new distributed time series database for such workloads, supporting efficient queries, including complex predicates across many metrics, while scaling out to support IoT ingest rates.

Office Hour with Michael Freedman (Timescale | Princeton University) Office Hour

Stop by and talk to Michael if you have questions about time series data or the new distributed time series database TimescaleDB.

Eric Frenkiel is the cofounder and CEO of MemSQL, an in-memory distributed database that combines real-time and historical big data analytics. MemSQL is a Y Combinator company that has raised more than $45M in venture capital. Prior to MemSQL, Eric worked at Facebook on partnership development. He has worked in various engineering and sales engineering capacities at both consumer and enterprise startups. Eric is a graduate of Stanford University’s School of Engineering. In 2011 and 2012, Eric was named to Forbes’s 30 under 30 list of technology innovators.

Presentations

Machines and the magic of fast learning (sponsored by MemSQL) Keynote

Eric Frenkiel explains how to use real-time data as a vehicle for operationalizing machine-learning models by leveraging MemSQL, exploring advanced tools, including TensorFlow, Apache Spark, and Apache Kafka, and compelling use cases demonstrating the power of machine learning to effect positive change.

Ellen Friedman is a solutions consultant, scientist, and O’Reilly author currently writing about a variety of open source and big data topics. Ellen is a committer on the Apache Drill and Mahout projects. With a PhD in biochemistry and years of work writing on a variety of scientific and computing topics, she is an experienced communicator. Ellen is coauthor of Streaming Architecture, the Practical Machine Learning series from O’Reilly, Time Series Databases, and her newest title, Introduction to Apache Flink. She’s also coauthor of a book of magic-themed cartoons, A Rabbit Under the Hat. Ellen has been an invited speaker at Strata + Hadoop in London, Berlin Buzzwords, the University of Sheffield Methods Institute, and the Philly ETE conference and a keynote speaker for NoSQL Matters 2014 in Barcelona.

Presentations

Why stream? The advantages of working with streaming data Tutorial

Life doesn’t happen in batches. Being able to work with data from continuous events as data streams is a better fit to the way life happens, but doing so presents some challenges. Ellen Friedman examines the advantages and issues involved in working with streaming data, takes a look at emerging technologies for streaming, and describes best practices for this style of work.

Ajit Gaddam is chief security architect at Visa. Ajit is a technologist, serial entrepreneur, and a security expert specializing in machine learning, cryptography, big data security, and cybersecurity issues. Over the last decade, Ajit held senior roles at various tech and financial firms and founded two startups. He is an active participant in various open source and security architecture standards bodies. As a well-known security expert and industry veteran, he has authored numerous articles and white papers for publication and is a frequent speaker at high-profile conferences such as BlackHat, Strata + Hadoop World, and SABSA World Congress. He holds multiple patents in data security and other disruptive technologies.

Presentations

End-to-end security for Kafka, Spark ML, and Hadoop Session

Apache Kafka is used by over 35% of Fortune 500 companies to store and process some of their most sensitive datasets. Ajit Gaddam and Jiphun Satapathy provide a security reference architecture to secure your Kafka cluster while leveraging it to support your organization's cybersecurity requirements.

Sriram Ganesan is a member of the technical staff at Qubole, where he works on HBase and cluster orchestration. Previously, Sriram was at Directi, where he worked on scaling the backend of leading chat app Talk.to. Sriram holds a bachelor of computer science engineering from the National Institute of Technology, Trichy, India.

Presentations

Moving big data as a service to a multicloud world Session

Qubole started out by offering Hadoop as a service in AWS. Over time, it extended its big data capabilities beyond Hadoop and its cloud infrastructure support beyond AWS. Sriram Ganesan and Prakhar Jain explain how and why Qubole built Cloudman, a simple, cloud-agnostic, multipurpose provisioning tool that can be extended for further engines and further cloud support.

Yael Garten leads a team of data scientists at LinkedIn that focuses on understanding and increasing growth and engagement of LinkedIn’s 400 million members across mobile and desktop consumer products. Yael is an expert at converting data into actionable product and business insights that impact strategy. Her team partners with product, engineering, design, and marketing to optimize the LinkedIn user experience, creating powerful data-driven products to help LinkedIn’s members be productive and successful. Yael champions data quality at LinkedIn; she has devised organizational best practices for data quality and developed internal data tools to democratize data within the company. Yael also advises companies on informatics methodologies to transform high-throughput data into insights and is a frequent conference speaker. She holds a PhD in biomedical informatics from the Stanford University School of Medicine, where her research focused on information extraction via natural language processing to understand how human genetic variations impact drug response, and an MSc from the Weizmann Institute of Science in Israel.

Presentations

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn Session

Shirshanka Das and Yael Garten share best practices learned using Kafka and Hadoop as the foundation of a petabyte-scale data warehouse at LinkedIn, offering concrete suggestions to help you process data seamlessly. Along the way, Shirshanka and Yael discuss their experience running governance to empower data teams.

How we work: The unspoken challenges of doing data science Session

Data science is a rewarding career. It's also really hard—not just the technical work itself but also "how to do the work well" in an organization. Yael Garten explores what data scientists do, how they fit into the broader company organization, and how they can excel at their trade and shares the hard and soft skills required, tips and tricks for success, and challenges to watch out for.

Office Hour with Yael Garten (LinkedIn) Office Hour

Curious about LinkedIn's experience with using Kafka and Hadoop at scale to run a data-driven company—or how you could use what they've learned? Stop by and talk to Yael.

Tim Gasper is director of product and marketing at Bitfusion, a deep learning automation software company enabling easier, faster development of AI applications, and cofounder of Ponos, an IoT-enabled hydroponics farming technology company. Tim has over eight years of big data, IoT, and enterprise content product management and product marketing experience. He is a writer and speaker on entrepreneurship, the Lean Startup methodology, and big data analytics. Previously, Tim was global portfolio manager for CSC Big Data and Analytics, where he was responsible for the overall strategy, roadmap, partnerships, and technology mix for the big data and analytics product portfolio; VP of product at Infochimps (acquired by CSC), where he led product development for its market-leading open data marketplace and big data platform as a service; and cofounder of Keepstream, a social media analytics and curation company.

Presentations

Robot farmers and chefs: In the field and in your kitchen Session

Food production and preparation have always been labor and capital intensive, but with the internet of things, low-cost sensors, cloud-computing ubiquity, and big data analysis, farmers and chefs are being replaced with connected, big data robots—not just in the field but also in your kitchen. Tim Gasper explores the tech stack, data science techniques, and use cases driving this revolution.

Bas Geerdink is a programmer, scientist, and IT manager at ING, where he is responsible for the fast data systems that process and analyze streaming data. Bas has a background in software development, design, and architecture with broad technical experience from C++ to Prolog to Scala. His academic background is in artificial intelligence and informatics. Bas’s research on reference architectures for big data solutions was published at the IEEE conference ICITST 2013. He occasionally teaches programming courses and is a regular speaker at conferences and informal meetings.

Presentations

Building a streaming analytics solution to provide real-time actionable insights to customers Tutorial

ING is a data-driven enterprise that is heavily investing in big data, analytics, and streaming processing. Bas Geerdink offers an overview of ING's streaming analytics solution for providing actionable insights to customers, built with a combination of open source technologies, including Kafka, Flink, and Cassandra.

Scott Gnau is the CTO of Hortonworks, a company at the forefront of emerging connected data platforms, where he works intimately with leaders in the Fortune 1000 undergoing business transformation through real-time data. Scott has spent his entire career in the data industry; previously, he was president of Teradata Labs, where he provided visionary direction for research, development, and sales support activities related to Teradata integrated data warehousing, big data analytics, and associated solutions. He also drove the investments and acquisitions in Teradata’s technology related to the solutions from Teradata Labs. Scott holds a BSEE from Drexel University.

Presentations

Big data adoption trends and use cases (sponsored by Hortonworks) Session

Big data is moving from science projects to mainstream, mission-critical deployments. Drawing on his interactions and conversations with business and IT leaders across the world, Scott Gnau outlines adoption trends and popular use cases.

David Goodman is CIO in residence at NetHope, where he focuses on bringing both technical and thought leadership to bear on the implementation of NetHope’s strategic plan. Toward that end, he works on developing strategies to bolster NetHope’s relationship with the technology sector and works with the NetHope leadership team to ensure NetHope activities are properly orientated toward its key constituents, CIOs, and technology leaders in the global development sector. Previously, David served as the CIO of the International Rescue Committee, where he had global responsibility for all technology-related activities and oversaw teams focused on infrastructure, application development, user services, and project management.

Presentations

Big data as a force for good Session

In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved.

Felix Gorodishter is a software architect at GoDaddy. Felix is a web developer, technologist, entrepreneur, husband, and daddy.

Presentations

Big data for operational insights Session

GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch.

Bill Graham is a staff engineer on the Real Time Compute team at Twitter. Bill’s primary areas of focus are data processing applications and analytics infrastructure. Previously, he was a principal engineer at CBS Interactive and CNET Networks, where he worked on ad targeting and content publishing infrastructure, and a senior engineer at Logitech focusing on webcam streaming and messaging applications. Bill contributes to a number of open source projects, including HBase, Hive, and Presto, and he’s a Heron and Pig committer.

Presentations

From rivulets to rivers: Elastic stream processing in Heron Session

Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience.

Arthur Grava is the big data team leader at Luizalabs, where he works closely with the company’s recommender system and focuses on machine learning with Hadoop, Java, Cassandra, and Python. Arthur holds a master’s degree in recommender systems from USP.

Presentations

Building a recommender from a big behavior graph over Cassandra Session

Gleicon Moraes and Arthur Grava share war stories about developing and deploying a cloud-based large-scale recommender system for a top-three Brazilian ecommerce company. The system, which uses Cassandra and graph traversal, led to a more than 15% increase in sales.

Jonathan Gray is the founder and CEO of Cask. Jonathan is an entrepreneur and software engineer with a background in startups, open source, and all things data. Prior to founding Cask, he was a software engineer at Facebook, where he helped drive HBase engineering efforts, including Facebook Messages and several other large-scale projects, from inception to production. An open source evangelist, Jonathan was responsible for helping build the Facebook engineering brand through developer outreach and refocusing the open source strategy of the company. Prior to Facebook, Jonathan founded Streamy.com, where he became an early adopter of Hadoop and HBase. He is now a core contributor and active committer in the community. Jonathan holds a bachelor’s degree in electrical and computer engineering from Carnegie Mellon University.

Presentations

Fixing what’s broken: Big data in the enterprise (sponsored by Cask) Session

Hadoop and Spark provide scale and flexibility at a low cost compared to data warehouses, but the messy and diverse nature of big data results in undesirable complexities and inefficiencies. Jonathan Gray explores the standardization, automation, and deep integration technologies that allow users to focus on application logic and insights rather than infrastructure and integration.

Ishmeet Grewal is a senior research analyst at Accenture Technology Labs, where he is the lead developer responsible for developing and prototyping a comprehensive strategy for automated analytics at scale. Ishmeet has traveled to 25 countries and likes to climb rocks in his free time.

Presentations

DevOps for models: How to manage millions of models in production Session

As Accenture scaled to millions of predictive models, it needed automation to manage models at scale, ensure accuracy, prevent false alarms, and preserve trust as models are created, tested, and deployed into production. Teresa Tung, Jürgen Weichenberger, and Ishmeet Grewal share their approach to implementing DevOps for models and employing a self-healing approach to model lifecycle management.

Jamie Grier is director of applications engineering at data Artisans, where he helps others realize the potential of Apache Flink in their own projects. Jamie has been working on stream processing for the last decade at companies such as Twitter, Gnip, and Boulder Imaging on projects spanning everything from ultra-high-performance video stream processing to social media analytics.

Presentations

Apache Flink: The latest and greatest Session

Jamie Grier outlines the latest important features in Apache Flink and walks you through building a working demo to show these features off. Topics include queryable state, dynamic scaling, streaming SQL, very large state support, and whatever is the latest and greatest in March 2017.

Robert Grossman is a faculty member and the chief research informatics officer in the Biological Sciences Division of the University of Chicago. Robert is the director of the Center for Data Intensive Science (CDIS) and a senior fellow at both the Computation Institute (CI) and the Institute for Genomics and Systems Biology (IGSB). He is also the founder and a partner of the Open Data Group, which specializes in building predictive models over big data. Robert has led the development of open source software tools for analyzing big data (Augustus), distributed computing (Sector), and high-performance networking (UDT). In 1996, he founded Magnify, Inc., which provides data-mining solutions to the insurance industry and was sold to ChoicePoint in 2005. He is also the chair of the Open Cloud Consortium, a not-for-profit that supports the research community by operating cloud infrastructure, such as the Open Science Data Cloud. He blogs occasionally about big data, data science, and data engineering at Rgrossman.com.

Presentations

The dangers of statistical significance when studying weak effects in big data: From natural experiments to p-hacking Session

When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices so that you will not be accused of p-hacking.

Mark Grover is a software engineer working on Apache Spark at Cloudera. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating) and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures AMA

Mark Grover and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Carlos Guestrin is the director of machine learning at Apple and the Amazon Professor of Machine Learning in Computer Science and Engineering at the University of Washington. Carlos was the cofounder and CEO of Turi (formerly Dato and GraphLab), a machine-learning company acquired by Apple. A world-recognized leader in the field of machine learning, Carlos was named one of the 2008 brilliant 10 by Popular Science. He received the 2009 IJCAI Computers and Thought Award for his contributions to artificial intelligence and a Presidential Early Career Award for Scientists and Engineers (PECASE).

Presentations

Trust your AI: High-precision explanations for the predictions of any machine-learning model Session

Carlos Guestrin offers an overview of anchors and aLIME, a novel, high-precision explanation technique for the predictions of any classifier in an interpretable and faithful manner, demonstrating the flexibility of these methods by explaining different models for text, image classification, and visual question answering and exploring the usefulness of explanations via novel experiments.

Debraj GuhaThakurta is a senior data scientist in Microsoft’s Azure Machine Learning group, where he focuses on the use of different platforms and toolkits, such as Microsoft’s Cortana Analytics Suite, R Server, SQL Server, Hadoop, and Spark clusters, for creating scalable and operationalized analytical processes for various business problems. Debraj has extensive industry experience in the biopharma and financial forecasting domains. He holds a PhD in chemistry and biophysics and did postdoctoral research in machine-learning applications in genomics. Debraj has published more than 25 peer-reviewed papers, book chapters, and patents.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Sijie Guo is the co-founder of Streamlio that focuses on building next generation real time data stack. Before Streamlio, he was the tech lead for messaging group at Twitter where he co-created Apache DistributedLog. He is also the PMC chair of Apache BookKeeper. Prior to Twitter, he worked on Yahoo! push notification infrastructure at Yahoo.

Presentations

Building reliable real-time services with Apache DistributedLog Session

Apache DistributedLog (incubating) is a low-latency, high-throughput replicated log service. Sijie Guo shares how Twitter has used DistributedLog as the real-time data foundation in production for years, supporting services like distributed databases, pub-sub messaging, and real-time stream computing and delivering more than 1.5 trillion (17 PB) events per day.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him.

Presentations

Getting started with TensorFlow Tutorial

Amy Unruh and Yufeng Guo walk you through training and deploying a machine-learning system using TensorFlow, a popular open source library. Amy and Yufeng begin by giving an overview of TensorFlow and demonstrating some fun, already-trained TensorFlow models.

Shekhar Gupta is a software engineer at Pepperdata. He holds a PhD from TU Delft, where he focused on using machine learning to improve and monitor the performance of distributed systems.

Presentations

Big data for big data: Machine-learning models of Hadoop cluster behavior Session

Sean Suchter and Shekhar Gupta describe the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.

Joel Gurin is president and founder of the Center for Open Data Enterprise, a Washington-based nonprofit that works to maximize the value of open data as a public resource. Joel also serves as a senior open data consultant to the World Bank. Previously, he wrote the book Open Data Now and led the launch team for the GovLab’s Open Data 500 study and Open Data Roundtables. Joel served as chair of the White House Task Force on Smart Disclosure and as chief of the Consumer and Governmental Affairs Bureau of the US Federal Communications Commission. For over a decade, Joel was editorial director and executive vice president of Consumer Reports, where he directed the development of ConsumerReports.org, the world’s largest paid-subscription, information-based website. He can be reached at joel@odenterprise.org.

Presentations

The future of open data: Building businesses with a major national resource Session

Open government data—free public data than anyone can use and republish—is a major resource for entrepreneurs and innovators. The Center for Open Data Enterprise has partnered with the White House, government, and businesses to show how this resource can create economic value. Joel Gurin and Katherine Garcia share case studies of how open data is being used and a vision for its future.

Alex Gutow is senior product marketing manager at Cloudera, focused on the analytic database platform solution and technologies. Prior to Cloudera, she managed technical marketing and PR for Basho Technologies and managed consumer and enterprise marketing for Truaxis, a MasterCard company. Alex holds a BS in marketing and BA in psychology from Carnegie Mellon University.

Presentations

BI and SQL analytics with Hadoop in the cloud Session

Henry Robinson and Alex Gutow explain how to best take advantage of the flexibility and cost-effectiveness of the cloud with your BI and SQL analytic workloads using Apache Hadoop and Apache Impala (incubating) to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments.

Ryan Hafen is an independent statistical consultant and an adjunct assistant professor in the Statistics Department at Purdue University. Ryan’s research focuses on methodology, tools, and applications in exploratory analysis, statistical model building, and machine learning on large, complex datasets. He is the developer of the datadr and Trelliscope components of the Tessera project (now DeltaRho) as well as the rbokeh visualization package. Ryan’s applied work on analyzing large, complex data has spanned many domains, including power systems engineering, nuclear forensics, high-energy physics, biology, and cybersecurity. Ryan holds a BS in statistics from Utah State University, an MStat in mathematics from the University of Utah, and a PhD in statistics from Purdue University.

Presentations

Exploration and visualization of large, complex datasets with R, Hadoop, and Spark Tutorial

Divide and recombine techniques provide scalable methods for exploration and visualization of otherwise intractable datasets. Stephen Elston and Ryan Hafen lead a series of hands-on exercises to help you develop skills in exploration and visualization of large, complex datasets using R, Hadoop, and Spark.

Matar Haller is a data scientist at Winton Capital. Previously, Matar was a neuroscientist at UC Berkeley, where she recorded and analyzed signals from electrodes surgically implanted in human brains.

Presentations

Automatic speaker segmentation: Using machine learning to identify who is speaking when Session

With the exploding growth of video and audio content online, there's an increasing need for indexable and searchable audio. Matar Haller demonstrates how to automatically identify who is speaking when in a recorded conversation using machine learning applied to a corpus of audio recordings. Matar shares how she approached the problem, the algorithms used, and steps taken to validate the results.

Bryan Harrison is vice president of credit and operational risk business intelligence at American Express. With more than 15 years’ business and IT experience in risk, analytics, and business intelligence capabilities leading global, complex projects spanning both business and IT organizations, Bryan has helped organizations manage strategic change with a combination of process improvement and innovation. Having held roles in both IT and analytics, Bryan understands the business and the technical side of BI and big data, enabling him to bridge the gap between people, processes, and technology.

Presentations

How American Express scaled BI on Hadoop for interactive, billion-row, 2,000+-user queries Tutorial

When processing 24% of total global credit card transactions, data, risk, and security are top priorities for American Express. Bryan Harrison highlights the modern process, people, and architecture approach that has enabled Amex to scale BI on Hadoop and given instant access to real-time, granular data, as well as broad historical views to model, so Amex can stay ahead of fraud in the future.

Jim Harrold is NationBuilder’s data services engineer, which puts him at the intersection of big data, public service, and politics. Previously, Jim worked at Project VoteSmart and the University of Nebraska Medical Center, where he conducted political research, collected health data, and researched Congress members. Jim holds an undergrad degree in political science from the University of Nebraska-Lincoln and a master’s degree in international relations and affairs.

Presentations

Wrangling the vote: Fueling campaign strategies analyzing diverse voter data Tutorial

Modern political campaigns at the local, state, and national level cannot be won without working with voter data. This has given birth to a wave of technology that has both influenced and insinuated itself into the fabric of modern politics. Peeking behind the curtain, Jim Harrold explores how this system utilizes data to help campaign strategists win elections.

Frances Haugen is a data product manager at Pinterest focusing on ranking content in the Home Feed and Related Pins and the challenges of driving immediate user engagement without harming the long-term health of the Pinterest content ecosystem. Previously, Frances worked at Google, where she founded the Google+ Search team and built the first non-“Quality”-based search experience at Google. (It was time based with light spam filtering.) She also cofounded the Google Boston Search team. Frances loves user-facing big data applications and finding ways to make mountains of information useful and delightful to the user. She was a member of the founding class of Olin College and holds a master’s degree from Harvard.

Presentations

When is data science a house of cards? Replicating data science conclusions Session

An experiment at Pinterest revealed somewhat shocking results. When nine data scientists and ML engineers were asked the same constrained question, they gave nine spectacularly different answers. The implications for business are astronomical. June Andrews and Frances Haugen explore the aspects of analysis that cause differences in conclusions and offer some solutions.

Joseph M. Hellerstein is the Jim Gray Chair of Computer Science at UC Berkeley and cofounder and CSO at Trifacta. Joe’s work focuses on data-centric systems and the way they drive computing. He is an ACM fellow, an Alfred P. Sloan fellow, and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.

Presentations

Beyond polarization: Data UX for a diversity of workers Session

Joe Hellerstein, Giorgio Caviglia, and Alon Bartur share their design philosophy for users and their experience designing UIs, illustrating their design principles with core elements from Trifacta, including the founding technology of predictive interaction, recent innovations like transform builder, and other developments in their core transformation experience.

Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic-net regularization in Spark’s ML library and one-pass elastic-net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine-learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

Presentations

Spark Structured Streaming for machine learning Session

Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model.

Ricky Hennessy is a data scientist at frog’s Austin studio, where he works on multidisciplinary teams to design and prototype data-driven products and help clients craft intelligent data strategies. Ricky has worked with clients in the defense, financial, insurance, professional sports, and retail industries. In addition to his work with frog, Ricky also works as a data science instructor at General Assembly. Ricky holds a PhD in biomedical engineering from UT Austin, where he gained expertise in scientific research, machine learning, algorithm development, and data analysis.

Presentations

Bringing data into design: How to craft personalized user experiences Session

From personalized newsfeeds to curated playlists, users want tailored experiences when they interact with their devices. Ricky Hennessy and Charlie Burgoyne explain how frog’s interdisciplinary teams of designers, technologists, and data scientists create data-driven, personalized, and adaptive user experiences.

Craig Hibbeler is principal for big data and security within MasterCard Advisors’ Enterprise Information Management consultancy practice. In his role, Craig leverages practical hands-on experience and broad industry and platform knowledge to develop, execute, secure, and drive results with customers’ big data platforms and initiatives.

Presentations

Big data as a force for good Session

In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved.

Leonard Hinds is a big data technologist with 25+ years’ experience in both hardware and software. He is a senior manager in Accenture’s Big Data practice, where he supports clients in CMT (communication, media, and technology), banking, and retail worldwide. Leonard holds a master’s degree in electrical engineering from Cornell University and bachelor of electrical engineering from Worcester Polytechnic Institute.

Presentations

Executive Briefing: Analytics and the new intelligent agenda Session

Join John Matchette and Leonard Hinds as they offer insights into how leading enterprises are unlocking new economic possibilities by embedding intelligence into the core of their business and explore five key actions businesses are taking today to realize the promise of big data and analytics.

Bob Horton is a senior data scientist in the Microsoft Partner ecosystem. Bob came to Microsoft from Revolution Analytics, where he was on the Professional Services team. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento. Bob currently holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Juliet Hougland is a data scientist at Cloudera and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil and gas pipelines at Deep Signal and designing and building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in applied mathematics from the University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in math-physics.

Presentations

Guerrilla guide to Python and Apache Hadoop Tutorial

Using an interactive demo format with accompanying online materials and data, data scientist Juliet Hougland offers a practical overview of the basics of using Python data tools with a Hadoop cluster.

Nischal HP is the cofounder and data scientist at Unnati Data Labs, where he is building end-to-end data science systems in the fields of fintech, marketing analytics, and event management. Nischal is also a mentor for data science on Springboard. Previously he built, from scratch, various ecommerce systems for catalog management, recommendation engines, and sentiment analyzer during his tenure at Redmart and built various data crawlers and intention mining systems and laid down initial work on an end-to-end text mining and analysis pipeline at SAP Labs. The majority of his work, however, was centered around building gamification of technical indicators for algorithmic trading platforms. Nischal has conducted workshops in the field of deep learning across the world and has spoken at a number of data science conferences. He is a strong believer of open source and loves to architect big, fast, and reliable systems. In his free time, he enjoys music, traveling, and meeting new people.

Presentations

Making architecture choices for small and big data problems Session

Not all data science problems are big data problems. Lots of small and medium product companies want to start their journey to become data driven. Nischal HP and Raghotham Sripadraj share their experience building data science platforms for various enterprises, with an emphasis on making the right architecture choices and using distributed and fault-tolerant tools.

Karen Hsu is head of growth at BlockCypher. Karen has over 20 years of experience in technology, with a focus on business intelligence, fintech, and the blockchain, and has worked in a variety of engineering, marketing, and sales roles to bring new products to market. She has coauthored four patents. Karen holds a BS in management science and engineering from Stanford University.

Presentations

Spark, GraphX, and blockchains: Building a behavioral analytics platform for forensics, fraud, and finance Session

Bryan Cheng and Karen Hsu describe how they built machine-learning and graph traversal systems on Apache Spark to help government organizations and private businesses stay informed in the brave new world of blockchain technology. Bryan and Karen also share lessons learned combining these two bleeding-edge technologies and explain how these techniques can be applied to private and federated chains.

Luhui Hu is the chief architect and CTO for Huawei’s big data cloud. Previously, Luhui spent 10 years at Amazon and Microsoft focusing on cloud computing, big data, AI, and ecommerce. He is an entrepreneurial technology leader with a strong record of innovation and delivery; he has led and launched multiple big data and ML cloud services over the years.

Presentations

Modern big data service architecture: Evolving from cloud-native and serverless to intelligent data clouds (sponsored by Futurewei Technologies) Session

With Huawei's big data cloud ecosystem, you can define and setup your data pipelines quickly and easily, whether you’re looking for batch processing or stream analytics. Luhui Hu shares best practices for designing a big data pipeline in the cloud and explains how to implement serverless big data solutions and intelligent data clouds.

Yin Huai is a software engineer at Databricks focusing on making Spark easy to use and manage. He is an Apache Spark committer and PMC member and an Apache Hive committer. Before joining Databricks, he was a PhD student at the Ohio State University, where he was advised by Xiaodong Zhang. His interests include storage systems, database systems, and query optimization.

Presentations

How Spark can fail or be confusing and what you can do about it Session

Just like any six-year-old, Apache Spark does not always do its job and can be hard to understand. Yin Huai looks at the top causes of job failures customers encountered in production and examines ways to mitigate such problems by modifying Spark. He also shares a methodology for improving resilience: a combination of monitoring and debugging techniques for users.

Grace Huang is the data science lead for discovery at Pinterest, where discovery products like recommendations and personalization are developed. She is passionate about building data science products around machine-learning algorithms to drive better experience for Pinterest users and build a sustainable ecosystem.

Presentations

Building a sustainable content ecosystem at Pinterest Session

With over 75 billion pins, the Pinterest content corpus is one of the largest human-curated collection of ideas. Grace Huang walks you through the lifecycle of a piece of content in Pinterest, a portfolio of metrics developed to monitor the health of the content corpus, and the story of creating a cross-functional initiative to preserve a healthy, sustainable content ecosystem.

Tim Hunter is a software engineer at Databricks and contributes to the Apache Spark MLlib project. Tim holds a PhD from UC Berkeley, where he built distributed machine-learning systems starting with Spark version 0.2.

Presentations

Best practices for deep learning on Apache Spark Session

Joseph Bradley and Tim Hunter share best practices for building deep learning pipelines with Apache Spark, covering cluster setup, data ingest, tuning clusters, and monitoring jobs—all demonstrated using Google’s TensorFlow library.

Alysa Z. Hutnik is a partner in the Advertising & Marketing and Privacy & Information Security practices at Kelley Drye & Warren LLP in Washington, DC. Her practice represents clients in all forms of consumer-protection matters, from counseling to defending regulatory investigations and litigation. Alysa’s specific focus is on privacy, data security, and advertising law, including unfair and deceptive practices, electronic and mobile commerce, and data sharing. Alysa is past chair of the ABA’s Privacy and Information Security Committee (Section of Antitrust), the cochair of the section’s 2011 Consumer Protection Conference, and the editor-in-chief of the ABA’s Data Security Handbook, a practical guide for data-security legal practitioners. To find out more about Alysa and Kelley Drye & Warren LLP, visit KelleyDrye.com, subscribe to the AdLawAccess.com blog, or find Kelley Drye on Facebook.

Presentations

Executive Briefing: Doing data right—Legal best practices for making your data work Session

Big data promises enormous benefits for companies, and new innovations in this space only mean more data collection is required. Having a solid understanding of legal obligations will help you avoid the legal snafus that can come with collecting big data. Alysa Hutnik and Crystal Skelton outline legal best practices and practical tips to avoid becoming a big data “don’t.”

Mario Inchiosa’s passion for data science and high-performance computing drives his work at Microsoft, where he focuses on delivering parallelized, scalable advanced analytics integrated with the R language. Previously, Mario served as Revolution Analytics’s chief scientist and as analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R. Prior to that, Mario was US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances. He also served as US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining, and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning Publication of the Year and Open Literature Publication Excellence awards.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Anand Iyer is a senior product manager at Cloudera, the leading vendor of open source Apache Hadoop. His primary areas of focus are platforms for real-time streaming, Apache Spark, and tools for data ingestion into the Hadoop platform. Before joining Cloudera, Anand worked as an engineer at LinkedIn, where he applied machine-learning techniques to improve the relevance and personalization of LinkedIn’s Feed. Anand has extensive experience leveraging big data platforms to deliver products that delight customers. He holds a master’s in computer science from Stanford and a bachelor’s from the University of Arizona.

Presentations

Practical considerations for running Spark workloads in the cloud Session

Both Spark workloads and use of the public cloud have been rapidly gaining adoption in mainstream enterprises. Anand Iyer and Eugene Fratkin discuss new developments in Spark and provide an in-depth discussion on the intersection between the latest Spark and cloud technologies.

Romit Jadhwani is business intelligence lead at Pinterest, where he is helping to improve business operations by enabling data analytics, data science, and visualization. Romit has over 10 years of experience in business intelligence and analytics across technology, online advertising, telecommunications, and financial services industries. He loves solving challenges at scale and is passionate about unlocking the maximum business potential of technology assets. Previously, Romit led a BI team at Google focused on financial analytics for Google’s advertising products. He holds a graduate degree in computer science.

Presentations

How Pinterest scaled to build the world’s catalog of 75+ billion ideas Session

Over the course of just six years, Pinterest has helped over 100 million pinners discover and collect over 75+ billion ideas to plan their everyday lives. Romit Jadhwani walks you through the different phases of this hypergrowth journey and explores the focuses, thought processes, and decisions of Pinterest’s data team as they scaled and enabled this growth.

Prakhar Jain is a member of the technical staff at Qubole, where he works on the cluster orchestration stack. Prakhar holds a bachelor of computer science engineering from the Indian Institute of Technology, Bombay, India.

Presentations

Moving big data as a service to a multicloud world Session

Qubole started out by offering Hadoop as a service in AWS. Over time, it extended its big data capabilities beyond Hadoop and its cloud infrastructure support beyond AWS. Sriram Ganesan and Prakhar Jain explain how and why Qubole built Cloudman, a simple, cloud-agnostic, multipurpose provisioning tool that can be extended for further engines and further cloud support.

Nandu Jayakumar is a software architect and engineering leader at Visa, where he is currently responsible for the long-term architecture of data systems and leads the data platform development organization. Previously, as a senior leader of Yahoo’s well-regarded data team, Nandu built key pieces of Yahoo’s data processing tools and platforms over several iterations, which were used to improve user engagement on Yahoo websites and mobile apps. He also designed large-scale advertising systems and contributed code to Shark (SQL on Spark) during his time there. Nandu holds a bachelor’s degree in electronics engineering from Bangalore University and a master’s degree in computer science from Stanford University, where he focused on databases and distributed systems.

Presentations

Swipe, dip, and hover: Managing card payment data at Visa Session

Visa is transforming the way it manages data: database appliances are giving way to Hadoop and HBase, and proprietary ETL is being replaced by Spark. Nandu Jayakumar and Rajesh Bhargava discuss the adoption of big data practices at this conservative financial enterprise and contrasts it with the adoption of the same ideas at Nandu's previous employer, a web/ad-tech company.

Calvin Jia is the top contributor to the Alluxio project and one of the earliest contributors. He started on the project as an undergraduate working in UC Berkeley’s AMPLab. He is currently a software engineer at Alluxio. Calvin has a BS from the University of California, Berkeley.

Presentations

Effective Spark with Alluxio Session

Alluxio bridges Spark applications with various storage systems and further accelerates data-intensive applications. Gene Pang and Jiri Simsa introduce Alluxio, explain how Alluxio can help Spark be more effective, show benchmark results with Spark RDDs and DataFrames, and describe production deployments with both Alluxio and Spark working together.

Chandan Joarder is the principal engineer at Macy’s, where his team is responsible for data services integration for Macys.com. The team has made significant strides in incorporating real-time analytics into the company’s endeavors.

Presentations

Building real-time dashboards with Kafka, web frameworks, and an in-memory database Session

Chandan Joarder shares a guide to building real-time dashboards in-house using tools such as Kafka, web frameworks, and an in-memory database, utilizing JavaScript and Scala. Along the way, Chandan also discusses the architectural principles used in these dashboards to provide up-to-the-hour business performance metrics and alerts.

Michael I. Jordan is the Pehong Chen Distinguished Professor in the Department of Electrical Engineering and Computer Science and the Department of Statistics at the University of California, Berkeley. His research interests bridge the computational, statistical, cognitive, and biological sciences; in recent years, he has focused on Bayesian nonparametric analysis, probabilistic graphical models, spectral methods, kernel machines, and applications to problems in distributed computing systems, natural language processing, signal processing, and statistical genetics. Previously, he was a professor at MIT. Michael is a member of the National Academy of Sciences, the National Academy of Engineering, and the American Academy of Arts and Sciences and a fellow of the American Association for the Advancement of Science, the AAAI, ACM, ASA, CSS, IEEE, IMS, ISBA, and SIAM. He has been named a Neyman Lecturer and a Medallion Lecturer by the Institute of Mathematical Statistics. He received the David E. Rumelhart Prize in 2015 and the ACM/AAAI Allen Newell Award in 2009. Michael holds a master’s degree in mathematics from Arizona State University and a PhD in cognitive science from the University of California, San Diego.

Presentations

Dirk Jungnickel is a senior vice president heading the central Business Analytics and Big Data function of Emirates Integrated Telecommunications Company (du), which integrates data warehousing, big data platforms, BI tools, data governance, business intelligence, and advanced analytics capabilities. Following an academic career in theoretical physics, with more than seven years of postdoctoral research, he has spent 17 years in telecommunications. A seasoned telecommunications executive, Dirk has held a variety of roles in international firms, including senior IT and IT architecture roles, various program management and business intelligence positions, the head of corporate PMO, and an associate partner with a global management and strategy consulting firm.

Presentations

Data monetization: A telecommunications use case Tutorial

Dirk Jungnickel explains how Dubai-based telco leader du leverages big data to create smart cities and enable location-based data monetization, covering business objectives and outcomes and addressing technical and analytical challenges.

Russell Jurney is principal consultant at Data Syndrome, a product analytics consultancy dedicated to advancing the adoption of the development methodology Agile Data Science, as outlined in the book Agile Data Science 2.0 (O’Reilly, 2017). He has worked as a data scientist building data products for over a decade, starting in interactive web visualization and then moving towards full-stack data products, machine learning and artificial intelligence at companies such as Ning, LinkedIn, Hortonworks and Relato. He is a self taught visualization software engineer, data engineer, data scientist, writer and most recently, he’s becoming a teacher. In addition to helping companies build analytics products, Data Syndrome offers live and video training courses.

Presentations

Office Hour with Russell Jurney (Data Syndrome) Office Hour

Join Russell to discuss the analytics methodology outlined in his book Agile Data Science 2.0 and the creation, deployment, and iterative improvement of a real-time predictive system using Python, Spark MLlib, Spark Streaming, Kafka, MongoDB, and JQuery.

David Kale is a deep learning engineer at Skymind and a PhD candidate in computer science at the University of Southern California (advised by Greg Ver Steeg of the USC Information Sciences Institute). David’s research uses machine learning to extract insights from digital data in high-impact domains, such as healthcare. Recently, he has pioneered the application of recurrent neural nets to modern electronic health records data. At Skymind, he is developing the ScalNet Scala API for DL4J and working on model interoperability between DL4J and other major frameworks. David organizes the Machine Learning and Healthcare Conference (MLHC), is a cofounder of Podimetrics, and serves as a judge in the Qualcomm Tricorder XPRIZE competition. David is supported by the Alfred E. Mann Innovation in Engineering Fellowship.

Presentations

Scalable deep learning for the enterprise with DL4J Tutorial

Dave Kale, Susan Eraly, and Josh Patterson explain how to build, train, and deploy neural networks using Deeplearning4j. Topics include the fundamentals of deep learning, ND4J and DL4J, and scalable training using GPUs and Apache Spark. You'll gain hands-on experience with several models, including convolutional and recurrent neural nets.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

Intelligent pattern profiling on semistructured data with machine learning Session

It's well known that data analysts spend 80% of their time preparing data and only 20% analyzing it. In order to change that ratio, organizations must build tools specifically designed for working with ad hoc (semistructured) data. Sean Kandel and Karthik Sethuraman explore a new technique leveraging machine learning to discover and profile the inherent structure in ad hoc datasets.

Why the next wave of data lineage is driven by automation, visualization, and interaction Session

Sean Kandel and Wei Zheng offer an overview of an entirely new approach to visualizing metadata and data lineage, demonstrating automated methods for detecting, visualizing, and interacting with potential anomalies in reporting pipelines. Join in to learn what’s required to efficiently apply these techniques to large-scale data.

Holden Karau is a software development engineer at IBM and is active in open source. Previously, she worked on a variety of big data, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. Holden is the author of Learning Spark and has assisted with Spark workshops. She holds a bachelor of mathematics in computer science from the University of Waterloo.

Presentations

Debugging Apache Spark Session

Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.

Spark Structured Streaming for machine learning Session

Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model.

Sunil Karkera is the CTO of the Digital Enterprise unit and the head of the Digital Reimagination Studio at Tata Consultancy Services, where he and his team of creative designers, engineers, and business strategists apply design thinking methodologies to fundamentally rethink business models, user experiences, and enabling technologies. Sunil has over 20 years’ experience in technology and design and has founded three startups in the Silicon Valley, all of which underwent successful acquisitions. He is a trained engineer and a typographer. Sunil was part of the early enterprise data wave, creating eBusiness Anywhere, which was acquired by Siebel Systems in 1999. Sunil was part of the successful IPO of Sonicwall, which disrupted internet security technologies, where he was part of the engineering team that built network security technologies, including SSL acceleration, content filtering, high-performance packet filters, and a high-networking throughput operating system, SonicOS. Sunil was the vice president of business systems at Fox Interactive Media, responsible for engineering MySpace, American Idol, Twentieth Century Fox Studios, IGN, Rotten Tomatoes, and GameSpy. In 2007, he cofounded Registria to innovate in product registration and after sales service and marketing. During this time, he architected the ecommerce and IoT backends for Nest, as well as created the type design used in the Nest Thermostat user interface. In 2014, he cofounded Nurture software, which focused on creating mobile apps backed with advanced machine-learning technologies in the service of better health and wellness for women and children. Sunil has been granted US patents in areas including caching, product registration, configuration management, and dynamic email processing. He holds a bachelor’s degree in computer science from Mangalore University in India. He has attended advanced typography design and color design programs at the University of Zurich (Zürcher Hochschule der Künste) under Rudolf Barmettler as well as modern arts programs at the Stedelijk Museum, Amsterdam for pointillism and graffiti. Sunil is passionate about computer history and volunteers at the Computer History Museum in Mountain View, CA.

Presentations

Executive Briefing: Artificial intelligence Session

Satya Ramaswamy and Sunil Karkeraof offer an overview of the recent technical advances that have made the current AI revolution possible, convincingly answering the "why now?" question.

Aneesh Karve is co-founder and CTO at Quilt, a data virtualization platform for data scientists. Previously, Aneesh worked as a product manager, lead designer, and software engineer at companies like Microsoft, NVIDIA, and Matterport. Aneesh was the general manager for AdJitsu, the first real-time 3D advertising platform for iOS, acquired in 2012. Aneesh’s research background spans proteomics, machine learning, and algebraic number theory. He holds degrees in chemistry, mathematics, and computer science.

Presentations

Visualization without guesswork Tutorial

Seemingly harmless choices in visualization design and content selection can distort your data and lead to false conclusions. Aneesh Karve presents a quantitative framework for identifying and overcoming distortions by applying recent research in algebraic visualization.

Andra Keay is the managing director of Silicon Valley Robotics, an industry group supporting innovation and commercialization of robotics technologies. Andra is also founder of Robot Launch, global robotics startup competition, cofounder of Robot Garden hackerspace, mentor at hardware accelerators, a startup advisor, and an active angel investor in robotics startups. Andra is also a director at Robohub.org, the global site for news and views on robotics. Previously Andra was an ABC film, television, and radio technician and taught interaction design at the University of Technology, Sydney. Andra has keynoted at major international conferences, including USI 2016, WebSummit 2014 and 2015, Collision 2015 and 2016, Pioneers Festival 2014, JavaOne 2014, Solid 2014, and SxSW 2015. She was selected as an HRI Pioneer in 2010. Andra holds a BA in communication from the University of Technology, Sydney, Australia, and an MA in human-robot culture from the University of Sydney, Australia, where her work built on her background as a robot geek, STEM educator, and filmmaker.

Presentations

Making good robots Keynote

Let’s stop talking about bad robots and start talking about what makes a robot good. A good or ethical robot must be carefully designed. Andra Keay outlines five principles of good robot design and discusses the implications of implicit bias in our robots.

Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where developed and open sourced techniques for anomaly detection and breakout detection. Prior research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Anomaly detection in real-time data streams using Heron Session

Anomaly detection plays a key role in the context of analysis of real-time streams. This is exemplified by, say, detection incidents in real life from tweet storms. Arun Kejariwal and Karthik Ramasamy walk you through how anomaly detection is supported in real-time data streams in Heron—the streaming system built in-house at Twitter (and open sourced) for real-time computation.

Phil Keslin is the founder and chief technology officer of Niantic, Inc., creator of the popular game Pokémon GO, which he started in early 2011 and incubated within Google. Phil led the engineering team in exploring the convergence of mobile, geo, and social within a range of applications leading up to the launch of Ingress and FieldTrip. Previously, Phil was a contributor to StreetView, Gmail, and Lively products at Google and a GPU architect at NVIDIA, where he was a key contributor to the design and development several of their GPUs. In 2000, Phil joined up with John Hanke to found Keyhole. As its CTO, Phil led the development of the Earthviewer application, which would later become Google Earth following the company’s acquisition by Google. Phil holds an MBA from Southern Methodist University and a bachelor’s degree in computer science from the University of Texas at Austin.

Presentations

Launching Pokémon GO Keynote

Pokémon GO was one of the fastest-growing games of all time, becoming a worldwide phenomenon in a matter of days. In conversation with Beau Cronin, Phil Keslin, CTO of Niantic, explains how the engineering team prepared for—and just barely survived—the experience.

Dale Kim is the senior director of industry solutions at MapR. His background includes a variety of technical and management roles at information technology companies. While Dale’s experience includes work with relational databases, much of his career pertains to nonrelational data in the areas of search, content management, and NoSQL and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University, and a BA in computer science from the University of California, Berkeley.

Presentations

Architectural considerations for building big data applications in the cloud Session

Big data applications in the cloud are becoming more about the global distribution and access of data than about easier deployments. Dale Kim shares insights on architecting big data applications for the cloud, using an example reference application his team built and published as context for describing several key requirements for cloud-based environments.

Kenn Knowles is a founding committer of Apache Beam (incubating). Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.

Presentations

Ask me anything: Apache Beam AMA

Join Tyler Akidau, Frances Perry, Kenneth Knowles, and Slava Chernyak to discuss anything related to Apache Beam.

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating) Session

Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow.

Mike Koelemay runs the Data Science team within Advanced Analytics at Sikorsky, where he is responsible for bringing state-of-the-art analytics and algorithm technologies to support the ingestion, processing, and serving of data collected onboard thousands of aerospace assets around the world. Drawing on his 10+ years of experience in applied data analytics for integrated system health management technologies, Mike works with other software engineers, data architects, and data scientists to support the execution of advanced algorithms, data mining, signal processing, system optimization, and advanced diagnostics and prognostics technologies, with a focus on rapidly generating information from large, complex datasets.

Presentations

Where data science meets rocket science: Data platforms and predictive analytics for aerospace Session

Sikorsky collects data onboard thousands of helicopters deployed worldwide that is used for fleet management services, engineering analyses, and business intelligence. Mike Koelemay offers an overview of the data platform that Sikorsky has built to manage the ingestion, processing, and serving of this data so that it can be used to rapidly generate information to drive decision making.

Daphne Koller is cochair of the board and cofounder of Coursera and the chief computing officer at Calico Labs, an Alphabet (Google) company that is using advanced technology to understand aging and design interventions that help people lead longer, healthier lives. Previously, she was the Rajeev Motwani Professor of Computer Science at Stanford University, where she served on the faculty for 18 years. Daphne is the author of over 200 refereed publications appearing in venues such as Science, Cell, and Nature Genetics. Daphne was recognized as one of Time magazine’s 100 most influential people in 2012 and one of Newsweek’s 10 most important people in 2010. She has been honored with multiple awards and fellowships during her career including the Sloan Foundation Faculty Fellowship in 1996, the ONR Young Investigator Award in 1998, the Presidential Early Career Award for Scientists and Engineers (PECASE) in 1999, the IJCAI Computers and Thought Award in 2001, the MacArthur Foundation Fellowship in 2004, and the ACM/Infosys Award in 2008. Daphne was inducted into the National Academy of Engineering in 2011 and elected a fellow of the American Academy of Arts and Sciences in 2014. Her teaching was recognized via the Cox Medal for excellence in fostering undergraduate research at Stanford in 2003, and by being named a Bass University Fellow in Undergraduate Education.

Presentations

Applying data and machine learning to scale education Keynote

Daphne Koller explains how Coursera is using large-scale data processing and machine learning in online education. Building on Coursera's wealth of online learning data, Daphne discusses the role of automation in scaling access to education that is personalized and efficient at connecting people with skills and knowledge throughout their lives.

Shyam Konda is a lead data engineer at Beachbody. Shyam has over 10 years of experience in the IT industry, with a focus on dat warehouse and BI platforms. Shyam started his career as ETL engineer, working on tools like Informatica, SSIS, Talend, and SAP data services. He’s well versed in writing SQL and PLSQL code against relational databases like Oracle, SQL Server, Teradata, MySQL, Hive, Redshift, Postgres, and Netezza and has worked on cutting-edge technologies like Hadoop, SQOOP, Hive, Pig, HBase, Elasticsearch, S3, and Amazon Redshift. Shyam has helped teams in previous companies build jobs which can be used as templates for code reusability, making maintenance and enhancements easy. He holds a master’s degree in computer science from Northwest Missouri State University.

Presentations

Building data lakes in the cloud with self-service access (sponsored by Talend) Session

Eric Anderson and Shyam Konda explain how the IT team at Beachbody—the makers of P90X and CIZE—successfully ingested all their enterprise data into Amazon S3 and delivered self-service access in less than six months with Talend.

Andy Konwinski is a founder and VP at Databricks. He has been working on Spark since the early days of the project, starting during his PhD in the UC Berkeley AMPLab, and has contributed as a software engineer to Spark’s performance evaluation components, testing infrastructure, documentation, and more. He was also a creator of the Apache Mesos project, contributed to the Hadoop Job Scheduler, and led the creation of the UC Berkeley AMP Camps and the Spark Summits. Andy coauthored Learning Spark from O’Reilly.

Presentations

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML Tutorial

Andy Konwinski introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine-learning library, using text mining on real-world data as the primary end-to-end use case.

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.

Presentations

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker offer an introduction to using Impala and Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Anirudh Koul is a data scientist at Microsoft. Anirudh brings a decade of applied research experience on petabyte-scale social media datasets, including Facebook, Twitter, Yahoo Answers, Quora, Foursquare, and Bing. He has worked on a variety of machine-learning, natural language processing, and information retrieval-related projects at Yahoo, Microsoft, and Carnegie Mellon University. Adept at rapidly prototyping ideas, Anirudh has won over two dozen innovation, programming, and 24-hour hackathons organized by companies including Facebook, Google, Microsoft, IBM, and Yahoo. He was also the keynote speaker at the 2014 SMX conference in Munich, where he spoke about trends in applying machine learning on big data.

Presentations

Squeezing deep learning onto mobile phones Session

Over the last few years, convolutional neural networks (CNN) have risen in popularity, especially in computer vision. Anirudh Koul explains how to bring the power of deep learning to memory- and power-constrained devices like smartphones and drones.

Jay Kreps is the cofounder and CEO of Confluent, a company focused on Apache Kafka. Previously, Jay was one of the primary architects for LinkedIn, where he focused on data infrastructure and data-driven products. He was among the original authors of a number of open source projects in the scalable data systems space, including Voldemort (a key-value store), Azkaban, Kafka (a distributed messaging system), and Samza (a stream processing system).

Presentations

Office Hour with Jay Kreps (Confluent) Office Hour

Jay is available to discuss Apache Kafka's roadmap and use cases and answer any other questions about Apache Kafka, Confluent, or streaming platforms.

The rise of real time: Apache Kafka and the streaming revolution Session

The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the union for stream processing, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Jay Kreps explores the future of Apache Kafka and the stream processing ecosystem.

Coco Krumme heads the data team at Haven and is adjunct faculty in the UC Berkeley master’s in data science program.

Presentations

How the shipping industry can become more data driven Tutorial

Data is transforming global trade. Using examples from historical trade and their work at Haven, Renee DiResta and Coco Krumme explore three frictions in logistics and container shipping—price opacity, inefficient markets, and unstructured data—and identify the important ways in which data will change how we price and exchange goods worldwide.

Sujay Kulkarni is a senior big data engineer at Malwarebytes. Previously, he he designed and implemented some pretty bad ass big data solutions on Hadoop at GoPro and spent seven years at Apple, where he held a number of roles, including developer, architect, team lead, and a technical chaos monkey, churning out brilliant ideas purely by accident. In his free time, Sujay likes photography, traveling, hiking, and sometimes just being a couch potato.

Presentations

Building an automation-driven Lambda architecture (sponsored by BMC) Session

Darren Chinen, Sujay Kulkarni, and Manjunath Vasishta demonstrate how to use a Lambda architecture to provide real-time views into big data by combining batch and stream processing, leveraging BMC’s Control-M as a critical component of both batch processing and ecosystem management.

Srini Kumar is the vice president of product management and data science at LevaData, Inc. Previously, he was a director of data science in the Algorithms and Data Science group at Microsoft, where he worked with strategic customers in the areas of Cortana Analytics and Microsoft R Server; headed product management for the information management (EIM) product suite at SAP; originated and architected a product on HANA to analyze human genome variants, which led to a discovery relating diabetes to a person’s origin and resulted in two patent applications related to modeling genomic variants and one related to enterprise information management; and helped turn around and sell a startup in the area of on-demand supply chain management software. Srini holds a master’s degree in industrial engineering from the University of Wisconsin-Madison and a bachelor’s degree in mechanical engineering from the Indian Institute of Technology, Madras.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Sasi Kuppannagari is a senior manager in the Sports Analytics and AI group at Intel, where he leads the technology stack and end-to-end solutions to build secure, scalable, and reliable big data platforms. Previously, Sasi worked at IBM Global Business Services managing engagements in cognitive computing information systems and in Watson solutions. He has extensive consulting experience at IBM, HP, Knightsbridge, and Accenture, specializing in enterprise data initiatives with expertise in big data solutions, business intelligence, predictive analytics, master data management, and data governance. Sasi has worked on a wide range of complex analytics initiatives for clients in finance, insurance, retail, energy, and healthcare verticals. Sasi holds a bachelor’s degree in mechanical engineering from India, an MS in industrial engineering from Florida State University, and an MBA from Cornell University.

Presentations

Big data analytics accelerating innovation in sports (sponsored by Intel) Session

Sasi Kuppannagari explores the innovative sports analytics solutions Intel is creating, such as using computer vision and big data analytics for athlete performance optimization.

Scott Kurth is the vice president of advisory services at Silicon Valley Data Science, where he helps clients define and execute the strategies and data architectures that enable differentiated business growth. Building on 20 years of experience making emerging technologies relevant to enterprises, he has advised clients on the impact of technological change, typically working with CIOs, CTOs, and heads of business. Scott has helped clients drive global technology strategy, conduct prioritization of technology investments, shape alliance strategy based on technology, and build solutions for their businesses. Previously, Scott was director of the Data Insights R&D practice within the Accenture Technology Labs, where he led a team focused on employing emerging technologies to discover the insight contained in data and effectively bring that insight to bear on business processes to enable new and better outcomes and even entire new business models, and led the creation of Accenture’s annual analysis of emerging technology trends impacting the future of IT, Technology Vision, where he was responsible for tracking emerging technologies, analyzing their transformational potential, and using it to influence technology strategy for both Accenture and its clients.

Presentations

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Julie Steele, Stephen O'Sullivan, and Scott Kurth field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for and the evolving role of the CDO. Even if you don’t have a specific question, join in to hear what others are asking.

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business.

Dwai Lahiri is a long-term IT Infrastructure guy. Dwai is a senior solutions architect at Cloudera, where he works with Cloudera’s hardware and private and public cloud partners, enabling them to run Cloudera’s Enterprise EDH stack on their respective platforms.

Presentations

How to leverage your private cloud infrastructure to deploy Hadoop Session

Dwai Lahiri explains how to leverage private cloud infrastructure to successfully build Hadoop clusters and outlines dos, don'ts, and gotchas for running Hadoop on private clouds.

Brian Lange is a partner and data scientist at Datascope, where he leads design process exercises and works on algorithms, web interfaces, and visualizations. Brian has contributed to projects for P&G, Thomson Reuters, Motorola, and other well-known companies, and his work has been featured on Nathan Yau’s FlowingData. While he’s not nerding out about typography and machine-learning techniques, Brian enjoys science and comedy podcasts, brewing beer, and listening to weird music.

Presentations

The perfect conference: Using stochastic optimization to bring people together Session

The goal of RCSA's Scialog conferences is to foster collaboration between scientists with different specialties and approaches, and, working with Datascope, the company has been doing so in a quantitative way for the last six years. Brian Lange discusses how Datasope and RCSA arrived at the problem, the design choices made in the survey and optimization, and how the results were visualized.

Once upon a time, Bill Lattner was a civil engineer. Now, he is a data scientist on the R&D team at Civis Analytics, where he spends most of his time writing tools for other data scientists, primarily in Python but also in R and occasionally Go. Prior to joining Civis, Bill was at Dishable, working on recommender systems and predicting dinning habits of Chicagoans.

Presentations

The power of persuasion modeling Session

How do we know that an advertisement or promotion truly drives incremental revenue? Michelangelo D'Agostino and Bill Lattner share their experience developing machine-learning techniques for predicting treatment responsiveness from randomized controlled experiments and explore the use of these “persuasion” models at scale in politics, social good, and marketing.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Pig. Julien is an architect at Dremio and was previously the tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

Office Hour with Julien Le Dem (Dremio) Office Hour

Join Julien to discuss columnar data processing and the hardware trends it can take advantage of.

The future of column-oriented data processing with Arrow and Parquet Session

In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, such as RDMA, SSDs, and nonvolatile memory.

Mike Lee Williams is director of research at Fast Forward Labs, an applied machine intelligence lab in New York City, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Fast Forward Labs’s clients understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Learning from incomplete, imperfect data with probabilistic programming Session

Real-world data is incomplete and imperfect. The right way to handle it is with Bayesian inference. Michael Williams demonstrates how probabilistic programming languages hide the gory details of this elegant but potentially tricky approach, making a powerful statistical method easy and enabling rapid iteration and new kinds of data-driven products.

Office Hour with Mike Lee Williams (Fast Forward Labs) Office Hour

Come chat with Mike about machine intelligence research in an applied business context and probabilistic programming and text summarization with topic models and recurrent neural networks.

Bob Lehmann is an architect on the Data Platform team at Monsanto, where he leads efforts to both modernize enterprise technology and transition to the cloud. Bob has held a number of positions in IT and engineering working with data ranging from high-volume sensor data to enterprise data (and everything in between). He holds a master’s degree in electrical engineering from Missouri University of Science and Technology.

Presentations

Stream me up, Scotty: Transitioning to the cloud using a streaming data platform Session

Gwen Shapira and Bob Lehmann share their experience and patterns building a cross-data-center streaming data platform for Monsanto. Learn how to facilitate your move to the cloud while "keeping the lights on" for legacy applications. In addition to integrating private and cloud data centers, you'll discover how to establish a solid foundation for a transition from batch to stream processing.

Jure Leskovec is chief scientist at Pinterest and associate professor of computer science at Stanford University. Jure’s research focuses on computation over massive data and has applications in computer science, social sciences, economics, marketing, and healthcare. This research has won several awards including a Lagrange Prize, Microsoft Research Faculty Fellowship, the Alfred P. Sloan Fellowship, and numerous best paper awards. Jure holds a bachelor’s degree in computer science from the University of Ljubljana, Slovenia, and a PhD in machine learning from Carnegie Mellon University and undertook postdoctoral training at Cornell University.

Presentations

Recommending 1+ billion items to 100+ million users in real time: Harnessing the structure of the user-to-object graph to extract ranking signals at scale Session

Pinterest built a flexible, graph-based system for making recommendations to users in real time. The system uses random walks on a user-and-object graph in order to make personalized recommendations to 100+ million Pinterest users out of a catalog of over a billion items. Jure Leskovec explains how Pinterest built its modern recommendation engine and the lessons learned along the way.

Haoyuan Li is founder and CEO of Alluxio (formerly Tachyon Nexus), a memory-speed virtual distributed storage system. Before founding the company, Haoyuan was working on his PhD at UC Berkeley’s AMPLab, where he cocreated Alluxio. He is also a founding committer of Apache Spark. Previously, he worked at Conviva and Google. Haoyuan holds an MS from Cornell University and a BS from Peking University.

Presentations

Alluxio (formerly Tachyon): The journey thus far and the road ahead Session

Alluxio (formerly Tachyon) is an open source memory-speed virtual distributed storage system. The project has experienced a tremendous improvement in performance and scalability and was extended with key new features. Haoyuan Li and Gene Pang explore Alluxio's goal of making its product accessible to an even wider set of users through a focus on security, new language bindings, and APIs.

Robin Li is a managing analytics engineer at Tapjoy, where he oversees the design, implementation, and maintenance of Tapjoy’s big data platforms for data science and analytics. Robin’s work involves architecture-level framework and platform building on top of big data technologies such as Hadoop, Spark, Vertica, and NoSQL. Previously, he worked in the financial service industry. Robin holds an MSc in computer science from Imperial College London.

Presentations

Building a real-time data science service for mobile advertising Tutorial

To ensure that users have the best application experience, Tapjoy has architected a data science service to handle ad-requests optimization and personalization in real time. Robin Li shares the critical considerations for building such a Lambda architecture and details the methods Tapjoy used to evaluate and implement its real-time architecture.

Yang Li is cofounder and CTO of Kyligence as well as a cocreator and PMC member of Apache Kylin. As the tech lead and architect of Kylin, Yang focuses on big data analysis, parallel computation, data indexing, relational algebra, approximation algorithms, and other technologies. Previously, he was senior architect of eBay’s Analytic Data Infrastructure department; tech lead of IBM’s InfoSphere BigInsights, where he was responsible for the Hadoop open source platform and winner of Outstanding Technical Achievement award; and a vice president at Morgan Stanley, responsible for the global regulatory reporting platform.

Presentations

Apache Kylin 2.0: From classic OLAP to real-time data warehouse Session

Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.

Martin is a Director within Deloitte’s Technology Consulting practice where he leads the Information Delivery capability and the Big Data proposition across FSI.

Martin’s areas of expertise involve Data Integration, Programme Delivery, Architecture & Design, Reporting and Performance Tuning applied to mission critical and complex systems.

Martin covers the entire breath of Data Management, from designing and implementing technical solutions to providing high level strategic advice, turning round failing projects and troubleshooting. He is an Information Delivery specialist with an excellent track record in delivering, designing, integrating and de-risking Data solutions at blue-chip financial institutions and insurance companies.

Presentations

Building the “future you” retirement planning service on a Hadoop data lake Tutorial

Chris Murphy explains how a major insurance company adopted Hadoop to leverage its data to underpin the new customer-centric ethos of the organization and how this has enabled a new approach to helping customers truly understand their financial portfolios, build a roadmap to meet their financial goals, identify opportunities, and help them secure their financial future.

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine-learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Presentations

Apache Kudu: 1.0 and beyond Session

Todd Lipcon offers a very brief refresher on the goals and feature set of the Kudu storage engine, covering the development that has taken place over the last year, including new features such as improved support for time series workloads, performance improvements, Spark integration, and highly available replicated masters.

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker offer an introduction to using Impala and Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

Office Hour with Todd Lipcon (Cloudera) Office Hour

Join Todd to talk about what's new in Kudu and real-time data applications with Kudu and Impala. You can also get advice on schema design, compression, encodings, partitioning, and best practices, as well as whether or not Kudu is a good fit for your use case.

Paige Liu is a software developer with Microsoft. Paige has been involved in the development of a wide range of diverse applications and services, from web applications to large-scale multitier distributed systems to hyper-scale search engine backends. While most of her experience is with Microsoft technology, Paige has also developed and released cross-platform solutions including Java APM (Application Performance Monitoring) and Linux Unix monitoring systems. Recently, she has been focusing on cloud computing, specifically with the Microsoft Azure cloud, helping enterprises to develop new applications in the cloud or move their existing workload to the cloud.

Presentations

Running a Cloudera cluster in production on Azure Session

Paige Liu and John Zhuge explore the options and trade-offs to consider when building a Cloudera cluster on Microsoft Azure Cloud and explain how to deploy and scale a Cloudera cluster on Azure and how to connect a Cloudera cluster with other Azure services to build enterprise-grade end-to-end big data solutions.

Victoria Livschitz is a founder and CTO of Grid Dynamics, a leading engineering IT services company known for transformative, mission-critical cloud solutions for the retail, finance, and technology sectors built with open source components. Under Victoria’s leadership, the company’s engineers have architected some of the busiest ecommerce services on the internet—none have ever had an outage during the peak season. Grid Dynamics is particularly known for its pioneering work in cloud-based big data and real-time analytics systems and technologies and for its contributions to many open source projects, including Hadoop, Solr, Lucene, and Storm. Previously, Victoria spent 10 years at Sun Microsystems in various technical leadership roles, including lead architect of General Motors, chief architect for financial services, senior scientist at Sun Labs, and principal engineer of Sun Grid, the industry’s first public cloud offering. Victoria holds a BS in computer science from CWRU and attended graduate programs in electrical engineering at Purdue University and computer science at Stanford University. 

Presentations

Open blueprint for real-time analytics in retail (sponsored by Grid Dynamics) Session

Victoria Livschitz outlines key business drivers for real-time analytics applications in retail and describes the emerging architectures based on in-stream processing (ISP) technologies. Victoria shares a complete open blueprint for an ISP platform—including a demo application for real-time Twitter sentiment analytics—designed with 100% open source components and deployable to any cloud.

Julie Lockner is cofounder of 17 Minds Corporation, a startup focusing on improving care and education plans for children with special needs. She has held executive roles at InterSystems, Informatica, and EMC and was an analyst at ESG. She was founder and CEO of CentricInfo, a data management consulting firm. Julie holds an MBA from MIT and a BSEEfrom WPI.

Presentations

Individualized care driven by wearable data and real-time analytics Session

How can we empower individuals with special needs to reach their potential? Julie Lockner offers an overview of a project to develop collaboration applications that use wearable device data to improve the ability to develop the best possible care and education plans. Join in to learn how real-time IoT data analytics are making this possible.

Andre Luckow is a researcher, technology enthusiast, and project manager for BMW Group currently living in Greenville, South Carolina. Andre’s interests range from technology to programming. He’s also interested in travel and innovation.

Presentations

Deep learning in the automotive industry: Applications and tools Tutorial

Andre Luckow shares best practices for developing and deploying deep learning solutions in the automotive industry and explores different deep learning frameworks, including TensorFlow, Caffe, and Torch, and deep neural network architectures, evaluating their trade-offs in terms of classification performance, training, and inference time.

Maura Lynch is a product manager at Pinterest. Before joining product, she worked in analytics both at Pinterest and in gaming for several years. Maura started her career in research in physics at Duke and in economics at the Federal Reserve.

Presentations

New user recommendations at scale: Identifying compelling content for low-signal users using a hybrid-curation approach Tutorial

New users are the most delicate for any service. Nailing their first experience with your product is essential to growing your user base. Maura Lynch offers an overview of Pinterest's hybrid-curation approach to creating compelling content streams for users when there is very little signal as to their preferences.

Roger Magoulas is the research director at O’Reilly Media and chair of the Strata + Hadoop World conferences. Roger and his team build the analysis infrastructure and provide analytic services and insights on technology-adoption trends to business decision makers at O’Reilly and beyond. He and his team find what excites key innovators and use those insights to gather and analyze faint signals from various sources to make sense of what others may adopt and why.​

Presentations

Thursday keynote welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Rajiv Maheswaran is CEO of Second Spectrum, an innovative sports analytics and data visualization startup located in Los Angeles, California. His work spans the fields of data analytics, data visualization, real-time interaction, spatiotemporal pattern recognition, artificial intelligence, decision theory, and game theory. Previously, Rajiv served as a research assistant professor within the University of Southern California’s Department of Computer Science and a project leader at the Information Sciences Institute at the USC Viterbi School of Engineering. He and Second Spectrum COO Yu-Han Chang codirected the Computational Behavior Group at USC. Rajiv has received numerous awards and written over 100 publications in artificial intelligence and control theory. Rajiv won the 2014 USC Viterbi School of Engineering Use-Inspired Research Award as well as both the 2014 and 2012 Best Research Paper (Alpha Award) at the renowned MIT Sloan Sports Analytics Conference. He is a frequent speaker at marquee technology conferences and events around the world. Rajiv holds a BS in applied mathematics, engineering, and physics from the University of Wisconsin-Madison and both an MS and PhD in electrical and computer engineering from the University of Illinois at Urbana-Champaign.

Presentations

When machines understand sports Keynote

What happens when machines understand sports? As Rajiv Maheswaran demonstrates, everything changes, from how coaches coach and how players play to how storytellers tells stories and how fans experience the game.

Roland Major is an enterprise architect at Transport for London, where he works on the Surface Intelligent Transport System, which is looking to improve the operation of the roads network across London and provide greater insight from existing and new data sources using modern data analytic techniques. Previously, Roland worked on event-driven architectures and solutions in the nuclear, petrochemical, and transport industries.

Presentations

Transport for London: Using data to keep London moving Tutorial

Transport for London (TfL) and its partners have been working together on broader integration projects focused on getting the most efficient use out of road networks and public transport. Roland Major explains how TfL brings together a wide range of data from multiple disconnected systems for operational purposes while also making more of them open and available, all in real time.

Ted Malaska is a senior solution architect at Blizzard. Previously, he was a principal solutions architect at Cloudera. Ted has 18 years of professional experience working for startups, the US government, some of the world’s largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the largest nonprofit financial regulator in the US and has worked on close to one hundred clusters for over two dozen clients with over hundreds of use cases. He has architecture experience across topics including Hadoop, Web 2.0, mobile, SOA (ESB, BPM), and big data. Ted is a regular contributor to the Hadoop, HBase, and Spark projects, a regular committer to Flume, Avro, Pig, and YARN, and the coauthor of Hadoop Application Architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures AMA

Mark Grover and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

James Malone is a product manager for Google Cloud Platform and manages Cloud Dataproc and Apache Beam (incubating). Previously, James worked at Disney and Amazon. James is a big fan of open source software because it shows what is possible when people come together to solve common problems with technology. He also loves data, amateur radio, Disneyland, photography, running, and Legos.

Presentations

Architecting and building enterprise-class Spark and Hadoop in cloud environments Tutorial

James Malone explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem.

Kevin Mao is a senior data engineer at Capital One Financial Services currently working on the Cybersecurity Data Lake team within Capital One’s Enterprise Data Services organization. Kevin’s current work involves designing and developing tools to ingest and transform cybersecurity-related data streams from across the organization into datasets that are used by security analysts for detecting and forecasting cyberthreats. Kevin holds a BS in computer science from the University of Maryland, Baltimore County and an MS in computer science from George Mason University. In his free time, he enjoys hiking, running, climbing, and snowboarding.

Presentations

Achieving real-time ingestion and analysis of security events through Kafka and Metron Session

Kevin Mao explores the value of and challenges associated with collecting raw security event data from disparate corners of enterprise infrastructure and transforming them into high-quality intelligence that can be used to forecast, detect, and mitigate cybersecurity threats.

Bruce Martin is a senior instructor at Cloudera, where he teaches courses on data science, Apache Spark, Apache Hadoop, and data analysis. Previously, Bruce was principle architect and director of advanced concepts at SunGard Higher Education, where he developed the software architecture for SunGard’s Course Signals Early Intervention System, which uses machine-learning algorithms to predict the success of students enrolled in university courses. Bruce’s other roles have included senior staff engineer at Sun Microsystems and researcher at Hewlett-Packard Laboratories. Bruce has written many papers on data management and distributed system technologies and frequently presents his work at academic and industrial conferences. Bruce holds patents on distributed object technologies. Bruce holds a PhD and master’s degree in computer science from the University of California at San Diego and a bachelor’s degree in computer science from the University of California, Berkeley.

Presentations

Data science at scale: Using Spark and Hadoop 2-Day Training

Bruce Martin walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field. Join in to learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities.

Data science at scale: Using Spark and Hadoop (Day 2) Training Day 2

Bruce Martin walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field. Join in to learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities.

Manish Marwah is a senior research scientist at Hewlett Packard Labs. His main research interests are in the broad area of data science, and its applications to cyber-physical systems, such as smart buildings and data centers. In particular, his research has focused on designing data mining methods for sustainability and energy management. Recently, he has been looking at large-scale analytics and its applications to IoT and security domains. His research has led to over 60 refereed papers, several of which have won awards, including at KDD 2009, IGCC 2011, and AAAI 2013. He has been granted 35 patents. Manish holds a PhD in computer science from the University of Colorado, Boulder and a BTech from the Indian Institute of Technology, Delhi.

Presentations

Malicious site detection with large-scale belief propagation Session

Alexander Ulanov and Manish Marwah explain how they implemented a scalable version of loopy belief propagation (BP) for Apache Spark, applying BP to large web-crawl data to infer the probability of websites to be malicious. Applications of BP include fraud detection, malware detection, computer vision, and customer retention.

John Matchette is a managing partner at leading management and technology consulting firm Accenture and is a member of the global operating committee that manages Accenture’s Analytics practice. John is also the global general manager of Accenture’s Analytic Platform business, which develops advanced technology products and transitions those products into market, where his scope of responsibilities includes R&D, operations, partner management, commercials, and sales. John has led several practice areas within Accenture, including the global Supply Chain Planning practice and the global Supply Chain Technologies practice, where he was responsible for sustaining the global talent pool and developing marketplace innovation. John transitioned into the market-facing supply chain role after an assignment in Accenture’s Center for Strategic Technology Research focusing on operations research. Previously, John held several market-facing P&L roles, including leading Accenture’s Public Safety Portfolio and overseeing Accenture’s presence at DHS, DOT, DOJ, DOS, DOE, US courts, and the EPA; leading Accenture’s activities at the Department of Homeland Security (DHS); and working in a variety of logistics areas within the DOD, notably at the Defense Logics Agency. In the early years of his career, John focused on constant client delivery for clients including DHS, DOD, Home Depot, Motorola, HP, Oracle, and Nike. John has led research teams in supply chain topics and has written articles on supply chain management. In 2004 and 2005, he served as the lead editor in the journal Achieving Supply Chain Excellence through Technology (ASCET). John holds a degree in industrial management, with concentrations in computer science and industrial engineering from Purdue University. John and his wife Cathy are proud parents to a son, Simon, and a daughter, Isabel.

Presentations

Executive Briefing: Analytics and the new intelligent agenda Session

Join John Matchette and Leonard Hinds as they offer insights into how leading enterprises are unlocking new economic possibilities by embedding intelligence into the core of their business and explore five key actions businesses are taking today to realize the promise of big data and analytics.

Desi Matel-Anderson is the chief wrangler of the Field Innovation Team (FIT) and CEO of the Global Disaster Innovation Group, LLC. FIT has deployed teams to several disasters including the Boston Marathon bombings, assisting at the scene with social media analysis; the Moore, Oklahoma, tornadoes, leading coding solutions; Typhoon Haiyan in the Philippines, through the building of cellular connectivity heat maps; and the Oso, Washington, mudslides, with unmanned aerial system flights, which resulted in a 3D print of the topography for incident command. The team also deploys to humanitarian crises, which have included running a robot petting zoo at the US/Mexico border and leading a women’s empowerment recovery movement after the Nepal earthquakes. Recently, her team deployed to Lebanon for the Syrian refugee crisis, supporting artificial intelligence for access to health care, establishing the power grid, and empowering refugees through evacuation routes utilizing 360-degree virtual reality capture video.

Previously, Desi was the first chief innovation advisor at FEMA, where she led the innovation team to areas affected by Hurricane Sandy to provide real-time problem solving in disaster response and recovery and ran think tanks nationwide to cultivate innovation in communities. Desi’s emergency management experience began when she volunteered in Northern Illinois University’s Office of Emergency Planning. She then worked with the Southeast Wisconsin Urban Area Security Initiative and the City of Milwaukee Office of Emergency Management. In addition to her regional emergency management duties, she worked as a nationwide assessor of the Emergency Management Accreditation Program. Desi lectures on innovation at Harvard, Yale, UC Berkeley, and several other universities across the country and serves as consultant on innovative practices and infrastructure for agencies and governments, nationally and internationally. Desi attended the National Preparedness Leadership Institute at Harvard’s Kennedy School of Government and School of Public Health and served on the advisory board of Harvard’s National Preparedness Leadership Institute in 2013. She holds a JD from Northern Illinois University.

Presentations

Data in disasters: Saving lives and innovating in real time Keynote

Data to the rescue. Desi Matel-Anderson offers an immersive deep dive into the world of the Field Innovation Team, who routinely find themselves on the frontier of disasters working closely with data to save lives, at times while risking their own.

Murthy Mathiprakasam is a director of product marketing for Informatica’s big data products. Murthy has a decade and a half of experience working with emerging high-growth software technologies, including roles at Mercury Interactive/HP, Google, eBay, VMware, and Oracle. Murthy holds an MS in management science from Stanford University and BS degrees in management science and computer science from the Massachusetts Institute of Technology.

Presentations

Get data lakes, data catalogs, and real-time streams in less time with fewer people and more machine learning (sponsored by Informatica) Session

Stuck with manual, siloed, inflexible, laborious practices for big data projects? Successful teams use machine-learning-based approaches to power self-service preparation, enterprise-wide data catalogs, and real-time stream processing with role-specific tools. Murthy Mathiprakasam explains how using Informatica atop Hadoop, Spark, and Spark Streaming maximizes teamwork, trust, and timeliness.

Carol Mcdonald is a solutions architect at MapR focusing on big data, Apache HBase, Apache Drill, Apache Spark, and machine learning in healthcare, finance, and telecom. Previously, Carol worked as a Technology Evangelist for Sun, an architect/developer on: a large health information exchange, a large loan application for a leading bank, pharmaceutical applications for Roche, telecom applications for HP, OSI messaging applications for IBM, and sigint applications for the NSA. Carol holds an MS in computer science from the University of Tennessee and a BS in geology from Vanderbilt University and is an O’Reilly Certified Spark Developer and Sun Certified Java Architect and Java Programmer. Carol is fluent in French and German.

Presentations

Applying machine learning to live patient data Session

Joseph Blue and Carol Mcdonald walk you through a reference application that processes ECG data encoding using HL7 with a modern anomaly detector, demonstrating how combining visualization and alerting enables healthcare professionals to improve outcomes and reduce costs and sharing lessons learned from their experience dealing with real data in real medical situations.

Martin Mendez-Costabel leads the Geospatial Data Asset team for Monsanto’s Products and Engineering organization within the IT department, where he drives the engineering and adoption of global geospatial data assets for the enterprise. He has more than 12 years of experience in the agricultural sector covering a wide range of precision agriculture-related roles, including data scientist and GIS manager for E&J Gallo Winery in California. Martin holds an agronomy degree (BSc) from the National University of Uruguay and two viticulture degrees: an MSc from the University of California, Davis and a PhD from the University of Adelaide in Australia.

Presentations

The enterprise geospatial platform: A perfect fusion of cloud and open source technologies Session

Recently, the volume of data collected from farmers' fields via sensors, rovers, drones, in-cabin technologies, and other sources has forced Monsanto to rethink its geospatial processing capabilities. Naghman Waheed and Martin Mendez-Costabel explain how Monsanto built a scalable geospatial platform using cloud and open source technologies.

Stephen Merity is a senior research scientist at MetaMind, part of Salesforce Research, where he works on researching and implementing deep learning models for vision and text, with a focus on memory networks and neural attention mechanisms for computer vision and natural language processing tasks. Previously, Stephen worked on big data at Common Crawl, data analytics at Freelancer.com, and online education at Grok Learning. Stephen holds a master’s degree in computational science and engineering from Harvard University and a bachelor of information technology from the University of Sydney.

Presentations

The frontiers of attention and memory in neural networks Session

While attention and memory have become important components in many state-of-the-art deep learning architectures, it's not always obvious where they may be most useful. Even more challenging, such models can be very computationally intensive for production. Stephen Merity discusses the most recent techniques, what tasks they show the most promise in, and when they make sense in production systems.

Greg Michaelson leads the client facing data science practice at DataRobot, where he and his team work with clients across the world to ensure their success using the DataRobot platform to solve their business problems. Previously, Greg led modeling teams at Travelers and Regions Financial. He holds a PhD in applied statistics from the Culverhouse College of Business Administration at the University of Alabama. Greg lives in Charlotte, NC, with his wife and four children and their pet tarantula.

Presentations

Exploiting Hadoop with artificial intelligence and machine learning (sponsored by DataRobot) Session

Companies store tons of data in Hadoop in hopes of turning the data into actionable insights, but maximizing the value of this resource with artificial intelligence and machine learning eludes most organizations. Greg Michaelson defines analytic trends around Hadoop, separates fact from hype, and sets out a roadmap for fully optimizing the value of the data stored in Hadoop.

John Mikula is a tech lead for Google Cloud, where he manages the team focused on enterprise features for Google Cloud Dataproc.

Presentations

Architecting and building enterprise-class Spark and Hadoop in cloud environments Tutorial

James Malone explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem.

Mostafa Mokhtar is a performance engineer at Cloudera. Previously, he held similar roles at Hortonworks and on the SQL Server team at Microsoft.

Presentations

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Rajat Monga leads TensorFlow, an open source machine-learning library and the center of Google’s efforts at scaling up deep learning. He is one of the founding members of the Google Brain team and is interested in pushing machine-learning research forward toward general AI. Previously, Rajat was the chief architect and director of engineering at Attributor, where he led the labs and operations and built out the engineering team. A veteran developer, Rajat has worked at eBay, Infosys, and a number of startups.

Presentations

The state of TensorFlow today and where it is headed in 2017 Session

Rajat Monga offers an overview of TensorFlow progress and adoption in 2016 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas.

Gleicon Moraes is director of data engineering at luc.id. Gleicon loves infrastructure for data, moving large volumes through distributed messaging systems, and databases. He uses Python, Go, and Erlang and focuses on distributed systems, nonrelational databases, and OSS.

Presentations

Building a recommender from a big behavior graph over Cassandra Session

Gleicon Moraes and Arthur Grava share war stories about developing and deploying a cloud-based large-scale recommender system for a top-three Brazilian ecommerce company. The system, which uses Cassandra and graph traversal, led to a more than 15% increase in sales.

Todd Mostak is the founder and CEO of MapD, a pioneer in building GPU-tuned analytics and visualization applications for the enterprise. Previously, Todd was a research fellow at the MIT’s Computer Science and Artificial Intelligence Laboratory, where he focused on GPU databases and visualization. Todd conceived of the idea of using GPUs to accelerate the extraction of insights from large datasets while conducting graduate research on the role of Twitter in the Arab Spring. Frustrated by the capabilities of conventional technologies to allow for the interactive exploration of these multimillion row datasets, Todd built one of the first GPU-based databases. Todd holds an MA in Middle Eastern studies from Harvard.

Presentations

From hours to milliseconds: How Verizon accelerated its mobile analytics Session

With more than 91M customers, Verizon produces oceans of data. The challenge this onslaught presents isn’t one of storage—it’s one of speed. The solution? Harnessing the power of GPUs to access insights in less than a millisecond. Todd Mostak and Abdul Subhan explain how Verizon solved its data challenge by implementing GPU-tuned analytics and visualization.

John Mount is a principal consultant at Win-Vector LLC, a San Francisco data science consultancy. John has worked as a computational scientist in biotechnology and a stock-trading algorithm designer and has managed a research team for Shopping.com (now an eBay company). He is the coauthor of Practical Data Science with R (Manning Publications, 2014). John started his advanced education in mathematics at UC Berkeley and holds a PhD in computer science from Carnegie Mellon (specializing in the design and analysis of randomized algorithms). He currently blogs about technical issues at the Win-Vector blog, tweets at @WinVectorLLC, and is active in the Rotary. Please contact jmount@win-vector.com for projects and collaborations.

Presentations

Modeling big data with R, sparklyr, and Apache Spark Tutorial

Sparklyr provides an R interface to Spark. With sparklyr, you can manipulate Spark datasets to bring them into R for analysis and visualization and use sparklyr to orchestrate distributed machine learning in Spark from R with the Spark MLlib and H2O SparkingWater libraries. John Mount demonstrates how to use sparklyr to analyze big data in Spark.

Office Hour with John Mount (Win-Vector LLC) Office Hour

If you're interesting in using Sparklyr, an R interface to Spark's distributed machine-learning algorithms, John's office hour could prove immensely valuable.

Chris Murphy recently joined the architecture team at one of the UK’s leading banks. A highly skilled and experienced IT solutions architect, Chris has held various IT positions at several global financial services companies, including Fidelity Investments, Zurich Insurance Group, and Liberty Mutual. Over the past 16 years, he has led and managed challenging technology projects, including, most recently, the design and delivery of a Hadoop operational data lake and an IT robotics platform. Chris enjoys speaking at major industry events about data management and analytics. He holds a BS in business information systems from University College Cork, Ireland.

Presentations

Building the “future you” retirement planning service on a Hadoop data lake Tutorial

Chris Murphy explains how a major insurance company adopted Hadoop to leverage its data to underpin the new customer-centric ethos of the organization and how this has enabled a new approach to helping customers truly understand their financial portfolios, build a roadmap to meet their financial goals, identify opportunities, and help them secure their financial future.

Justin Murray is a technical product marketing manager in big data at VMware, where he works with VMware’s customers and field engineering to create guidelines and best practices for using virtualization technology for big data. He has spoken at a variety of conferences on these subjects and has published blogs, white papers, and other materials in this field.

Presentations

Virtualizing Hadoop and Spark: Architecture, performance, and best practices (sponsored by VMware) Session

Justin Murray outlines the benefits of virtualizing Hadoop and Spark, covering the main architectural approaches at a technical level and demonstrating how the core Hadoop architecture maps into virtual machines and how those relate to physical servers. You'll gain a set of design approaches and best practices to make your application infrastructure fit well with the virtualization layer.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).

Presentations

The future of column-oriented data processing with Arrow and Parquet Session

In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, such as RDMA, SSDs, and nonvolatile memory.

Vijay Narayanan heads the algorithms and data science efforts in the Data group in Microsoft, where he works on building and leveraging machine-learning platforms, tools, and solutions to solve analytic problems in diverse domains. Previously, Vijay was a principal scientist at Yahoo Labs, where he worked on building cloud-based machine-learning applications in computational advertising; an analytic science manager in FICO, where he worked on launching a product to combat identify theft and application fraud using machine learning; a modeling researcher at ACI Worldwide; and a Sloan Digital Sky Survey research fellow in astrophysics at Princeton University, where he codiscovered the ionization boundary and the four farthest quasars in the universe. Vijay has authored or coauthored approximately 55 peer-reviewed papers in astrophysics, 10 papers in machine-learning and data mining techniques and applications, and 15 patents (filed or granted). He is deeply interested in the theoretical, applied and business aspects of large-scale data mining and machine learning and has indiscriminate interests in statistics, information retrieval, extraction, signal processing, information theory, and large-scale computing. Vijay holds a bachelor of technology degree from IIT, Chennai and a PhD in astronomy from the Ohio State University.

Presentations

Big data, AI, the genome, and everything (sponsored by Microsoft) Keynote

Vijay Narayanan takes you on an inspiring journey exploring how the cloud, data, and artificial intelligence are powering and accelerating the genomic revolution—saving and changing lives in the process.

Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.

Presentations

Building your first big data application on AWS Tutorial

Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Ben Snively, Radhika Ravirala, Ryan Nienhuis, and Dario Rivera walk you through building a big data application using open source technologies, such as Apache Hadoop, Spark, and Zeppelin, and AWS managed services, such as Amazon EMR, Amazon Kinesis, and more.

Dinesh Nirmal is vice president of analytics development at IBM. Dinesh has held a number of roles at IBM, including a member of the SAP porting team for z/OS, work on JDBC/SQLJ application development, senior manager for DB2 Optim tools development, IMS director, vice president of Smarter Process, vice president of analytics for z, and a Silicon Valley Lab site executive. Dinesh holds an MS in computer science, an MBA in finance, and a BS in chemistry from SUNY.

Presentations

Machine learning is about your data and deployment, not just model development (sponsored by IBM) Keynote

Which is more important: the model or the data? Dinesh Nirmal explains how your data can help you build the right cognitive systems to learn about, reason with, and engage with your customers.

Jack Norris is the senior vice president of data and applications at MapR Technologies. Jack has a wide range of demonstrated successes, from defining new markets for small companies to increasing sales of new products for large public companies, in his 20 years spent in enterprise software marketing. Jack’s broad experience includes launching and establishing analytics, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity, Brio Technology, SQRIBE, and Bain & Company. Jack has an MBA from UCLA’s Anderson School of Management and a BA in economics with honors and distinction from Stanford University.

Presentations

The main event: Identifying and exploiting the keys to digital transformation Session

Leading companies are integrating operations and analytics to make real-time adjustments to improve revenues, reduce costs, and mitigate risks. There are many aspects to digital transformation, but the timely delivery of actionable data is both a key enabler and an obstacle. Jack Norris explores how companies from TransUnion to Uber use event-driven processing to transform their businesses.

Lana Novikova is the founder and CEO of text analytics startup Heartbeat Ai Technologies, which is bringing a unique perspective to collecting and analyzing affect data. Lana is a market research innovator and tech entrepreneur with a solid research management career and an award-winning portfolio of research inventions. A marketer by training and a market researcher by trade, she was never satisfied with mere numbers and shallow observations, always pushing to understand the “deep why”​ behind peoples’ decisions by connecting the dots between cognitive sciences, consumer research, and marketing. This drive is evident in Heartbeat Ai’s products, which offer insight tools to business leaders who, like Lana, are not satisfied with the "status quo.”

Presentations

Emotion text analytics for deeper understanding and better prediction of irrational human behavior Tutorial

What if we could rely on text data to be the “secret sauce” for accurate prediction of future events that are based on human decisions, such as elections, consumer behavior, public opinion, and social movements. Lana Novikova shares the results of Heartbeat Ai's experiment to see if it could build a predictive algorithm for national elections using unstructured text data from surveys.

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of @WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Julie Steele, Stephen O'Sullivan, and Scott Kurth field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for and the evolving role of the CDO. Even if you don’t have a specific question, join in to hear what others are asking.

Office Hour with John Akred and Stephen O'Sullivan (Silicon Valley Data Science) Office Hour

Want advice on building a data platform, tool selection, or integration with legacy systems? Talk to John and Stephen.

Bill ODonnell is strategic security architect of security product management at MapR. With over 20 years’ experience in information technology, Bill understands the complexities involved in designing enterprise solutions from every perspective. For the last decade, he has spearheaded secure, flexible, and cutting-edge solutions, most recently as a security senior technology leader at IBM, where he was the chief security architect for IBM WebSphere, authored the company’s Security Intelligence blog, and received over 30 notable awards. Bill has registered 13 patents and is known as a security subject-matter expert at conferences and in publications. 

Presentations

Pluggable security in Hadoop Session

Security will always be very important in the world of big data, but the choices today mostly start with Kerberos. Does that mean setting up security is always going to be painful? What if your company standardizes on other security alternatives? What if you want to have the freedom to decide what security type to support? Yuliya Feldman and Bill ODonnell discuss your options.

Max Ogden is the director of Code for Science, a nonprofit behind the Dat Project, a open source distributed filesystem for the web that synchronizes large datasets over a peer-to-peer network. Max is a computer programmer of civic media, open data, and open source as well as a former Code for America fellow, Node.js and JavaScript community organizer, and the author of hundreds of small open source modules. Max is passionate about teaching and enabling the sharing of information.

Presentations

Data at risk: Backing up the world's research data Session

Max Ogden offers an overview of Data Refuge, a nationwide volunteer effort led by librarians, scientists, and coders to discover and back up research data at risk of disappearing. Max discusses his work to uncover hundreds of federal data servers containing petabytes of publicly funded research data and his plan to keep it online and useful to researchers in the future.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Big data as a force for good Session

In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved.

The machine-learning renaissance Keynote

Data is powering a machine-learning renaissance. Understanding our data helps save lives, secure our personal and business information, and engage our customers with better relevance. However, as Mike Olson explains, without big data and a platform to manage big data, machine learning and artificial intelligence just don’t work.

Jerry Overton is a data scientist and distinguished technologist in DXC’s Analytics group, where he is the principal data scientist for industrial machine learning, a strategic alliance between DXC and Microsoft comprising enterprise-scale applications across six different industries: banking and capital markets, energy and technology, insurance, manufacturing, healthcare, and retail. Jerry is the author of Going Pro in Data Science: What It Takes to Succeed as a Professional Data Scientist (O’Reilly) and teaches the Safari training course Mastering Data Science at Enterprise Scale. In his blog, Doing Data Science, Jerry shares his experiences leading open research and transforming organizations using data science.

Presentations

Executive Briefing: An executive’s guide to understanding advanced analytics in the cloud Session

Jerry Overton provides an executive's guide to understanding advanced analytics in the cloud—offering a comprehensive survey of cloud technologies, patterns of cloud-based architectures, and patterns of enterprise cloud adoption, describing paths to achieving a cognitive enterprise, and outlining the realistic next steps for executives.

Avinash Padmanabhan is a staff quality engineer in Intuit’s Small Business Data and Analytics group, where he focuses on ensuring quality of the data pipeline that enables the work of analysts and business stakeholders. Avinash has over 12 years of experience specializing in building frameworks and solutions that solve challenging quality problems and delight customers. He holds a master’s degree in electrical and computer engineering from the State University of New York.

Presentations

Shifting left for continuous quality in an Agile data world Session

Data warehouses are critical in driving business decisions—with SQL dominantly used to build ETL pipelines. While the technology has shifted from using RDBMS-centric data warehouses to data pipelines based on Hadoop and MPP databases, engineering and quality processes have not kept pace. Avinash Padmanabhan highlights the changes that Intuit's team made to improve processes and data quality.

Shoumik Palkar is a second-year PhD student in the Infolab at Stanford University, working with Matei Zaharia on high-performance data analytics. He holds a degree in electrical engineering and computer science from UC Berkeley.

Presentations

Weld: An optimizing runtime for high-performance data analytics Session

Modern data applications combine functions from many libraries and frameworks and cannot achieve peak hardware performance due to data movement across functions. Shoumik Palkar offers an overview of Weld, an optimizing runtime that enables optimizations across disjoint libraries, and explains how to integrate it into frameworks such as Spark SQL for performance gains with no changes to user code.

Lloyd Palum is the CTO of Vnomics, where he directs the company’s technology development associated with optimizing fuel economy in commercial trucking. Lloyd has more than 25 years of experience in both commercial and government electronics. Previously, he led the development of a new product line of surveillance equipment at Harris Corp. and helped facilitate the growth of the company in new markets; he was also director of DSP application development at a startup focused on configurable digital signal processing systems. A leader in the design of communication and networking systems, Lloyd has published a number of technical articles and speaks frequently at industry conferences. He holds five patents in the field of software and wireless communications. Lloyd earned an MS in electrical engineering (MSEE) from Boston University and a BS in electrical engineering (BSEE) from the University of Rochester.

Presentations

How Vnomics built and deployed a “digital twin” in commercial trucking that led to $160M (and counting) in verified operational fuel savings Tutorial

Lloyd Palum explores the importance of identifying the target business value in an IIoT application—a prerequisite to justifying a return on technology investment—and explains how to deliver that value using the concept of a “digital twin.”

Gene Pang is a software engineer at Alluxio. Previously, he worked at Google. Gene recently earned his PhD from the AMPLab at UC Berkeley, working on distributed database systems, and holds an MS from Stanford University and a BS from Cornell University.

Presentations

Alluxio (formerly Tachyon): The journey thus far and the road ahead Session

Alluxio (formerly Tachyon) is an open source memory-speed virtual distributed storage system. The project has experienced a tremendous improvement in performance and scalability and was extended with key new features. Haoyuan Li and Gene Pang explore Alluxio's goal of making its product accessible to an even wider set of users through a focus on security, new language bindings, and APIs.

Kishore Papineni is director of information strategy and management and RWI and analytics at Astellas Pharma.

Presentations

Astellas Pharma's marketing analytics data lake Tutorial

Launched in late 2015, Astellas's enterprise data lake project is taking the company on a data governance journey. Kishore Papineni offers an overview of the project, providing insights into some of the business pain points and key drivers, how it has led to organizational change, and the best practices associated with Astellas's new data governance process.

Kartik Paramasivam is a senior software engineering leader at LinkedIn. Kartik specializes in cloud computing, distributed systems, enterprise and cloud messaging, stream processing, the internet of things, web services, middleware platforms, application hosting, and enterprise application integration (EAI). He has authored a number of patents. Kartik holds a bachelor of engineering from the Maharaja Sayajirao University of Baroda and an MS in computer science from Clemson University.

Presentations

Processing millions of events per second without breaking the bank Session

LinkedIn has one of the largest Kafka installations in the world, ingesting more than a trillion messages per day. Apache Samza-based stream processing applications process this deluge of data. Kartik Paramasivam discusses key improvements and architectural patterns that LinkedIn has adopted in its data systems in order to process millions of requests per second while keeping costs in control.

Jacob Parr is the owner of JParr Productions, where he writes couseware, leads one-on-one training for companies like Databricks, Nike, Comcast, Cisco, AOL, and Moody’s Analytics, and speaks at conferences like Spark Summit. Jacob became interested in software development at the age of 11, and just two years later, he began programming his own video games—he’s been developing software ever since. Over his 20-year career, he has worked in software testing and test automation for Sierra On-Line (aka The ImagiNation Network aka AOL Entertainment), where he also developed software for Sierra Telephone, first as an engineer and eventually as an architect and senior developer, and in custom software development for websites, ecommerce systems, real-estate applications, and even the occasional enterprise tax consultant. His background includes telecommunications, billing systems, service order systems, trouble ticketing systems, and enterprise integration, and he has built everything from swing apps to monoliths to REST and microservices architecture. He participates in a number of open source projects. Jacob lives in Oakhurst, CA, with his lovely wife. As empty-nesters of three adult children, they enjoy spoiling their Boston terriers. He loves to play practical jokes, fly drones, chase his nephews and nieces with an arsenal of Nerf guns, and work on his n-scale train set. In his little spare time, he loves to (you guessed it) work on his pet software projects.

Presentations

Spark foundations: Prototyping Spark use cases on Wikipedia datasets 2-Day Training

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Jacob Parr employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

Spark foundations: Prototyping Spark use cases on Wikipedia datasets (Day 2) Training Day 2

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Jacob Parr employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

Nixon Patel is founder, CEO, and MD of analytics startup Kovid. Nixon is a visionary leader, an exemplary technocrat, and a successful, business-oriented entrepreneur with a proven track record for growing six businesses from startups to large, global technology companies with millions in annual sales—in industries ranging from big data technology, analytics, the cloud, the IoT, speech recognition, and machine learning to renewable energy, information technology, telecommunications, and pharmaceuticals—all over a short 26-year span. Previously, he was chief data scientist at Brillio, a sister company of Collabera, where he was instrumental in starting the big data analytics, cloud, and IoT practices and establishing centers of excellence and co-innovation labs. He is also an independent director in VivMed Labs and Tripborn. Nixon holds a BT with honors in chemical engineering from IIT Kharagpur, an MS in computer science from the New Jersey Institute of Technology, and a data science specialization from Johns Hopkins University. He is currently pursuing a second master’s degree in business and science in analytics from Rutgers University.

Presentations

Real-time analysis of behavior of law enforcement encounters using big data analytics and deep learning multimodal emotion-recognition models Tutorial

Being able to monitor the emotional state of police officers over a period of time would enable the officers’ supervisors to intervene if a given officer is subject to repeated emotional stress. Nixon Patel presents deep learning and AI models that capture and analyze the emotional state of law enforcement officers over a period of time.

Josh Patterson is the director of field engineering for Skymind. Previously, Josh ran a big data consultancy, worked as a principal solutions architect at Cloudera, and was an engineer at the Tennessee Valley Authority, where he was responsible for bringing Hadoop into the smart grid during his involvement in the openPDC project. Josh is a graduate of the University of Tennessee at Chattanooga with a master of computer science, where he did research in mesh networks and social insect swarm algorithms. Josh is a cofounder of the DL4J open source deep learning project and is a coauthor on the upcoming O’Reilly title Deep Learning: A Practitioner’s Approach. Josh has over 15 years’ experience in software development and continues to contribute to projects such as DL4J, Canova, Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif.

Presentations

Scalable deep learning for the enterprise with DL4J Tutorial

Dave Kale, Susan Eraly, and Josh Patterson explain how to build, train, and deploy neural networks using Deeplearning4j. Topics include the fundamentals of deep learning, ND4J and DL4J, and scalable training using GPUs and Apache Spark. You'll gain hands-on experience with several models, including convolutional and recurrent neural nets.

Rajiv Ayyangar is the Head of Product at Yakit Logistics, a B2C cross-border shipping API. He was previously a data scientist on Aviate (acq. Yahoo) and a product manager at Delectable.

Presentations

Greasing the Wheels of International Logistics Tutorial

Content will be shared on stage.

Vanja Paunić is a data scientist on the Azure Machine Learning team at Microsoft. Previously, Vanja worked as a research scientist in the field of bioinformatics, where she published on uncertainty in genetic data, genetic admixture, and prediction of genes. She holds a PhD in computer science with a focus on data mining from the University of Minnesota.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Frances Perry is a software engineer who likes to make big data processing easy, intuitive, and efficient. After many years working on Google’s internal data processing stack, Frances joined the Cloud Dataflow team to make this technology available to external cloud customers. She led the early work on Dataflow’s unified batch/streaming programming model and is on the PMC for Apache Beam.

Presentations

Ask me anything: Apache Beam AMA

Join Tyler Akidau, Frances Perry, Kenneth Knowles, and Slava Chernyak to discuss anything related to Apache Beam.

Learn stream processing with Apache Beam Tutorial

Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau and Frances Perry cover the basics of robust stream processing with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow.

Chris Pouliot is a real life rocket scientist, who has also spun astronauts until they were motion sick, split atoms to make an aircraft carrier go fast, provided insightful analysis that led to Google to change their top ad color, and helped Netflix determine what movies and TV shows to buy, and how much they should pay. He is currently the VP of Data Science at Lyft where he leads a team of data scientists who are working on algorithms to improve the Lyft user experience (e.g. dispatch, ETA, pricing).

Presentations

Unboxing logistics innovation Tutorial

Michael Abbott shares trends Kleiner Perkins Caufield & Byers is seeing in the area of transportation and logistics from an investments perspective and offers direct insights from companies in the sector, looking at how these firms deal with unique data processing challenges.

Ganesh Prabhu is a staff software engineer at FireEye with 20+ years of RDBMS and engineering experience.

Presentations

FireEye's journey migrating 25 TB of RDBMS data to Hadoop Session

Ganesh Prabhu, Alex Rivlin, and Vivek Agate share an approach that enabled a small team at FireEye to migrate 20 TB of RDBMS data comprised of 250+ tables and nearly 2,000 partitions to Hadoop and an adaptive platform that allows migration of a rapidly changing dataset to Hive. Along the way, they explore some of the challenges typical for a company implementing Hadoop.

Ryan Pridgeon is a customer operations engineer at Confluent. Ryan has a deep-rooted passion for tinkering that knows no bounds. Be it automotive, software, or carpentry, if it has pieces, Ryan wants to take it apart. He’s still working on putting things back together though.

Presentations

Mistakes were made, but not by us: Lessons from a year of supporting Apache Kafka Session

Dustin Cote and Ryan Pridgeon share their experience troubleshooting Apache Kafka in production environments and discuss how to avoid pitfalls like message loss or performance degradation in your environment.

Manny Puentes is an experienced executive leader in the digital advertising and software industries. He is the founder and CEO of Boulder-based startup Rebel AI, which addresses some of the digital advertising industry’s most pressing problems through strategic consulting and a suite of products built to ensure ad security and quality in programmatic media trading. With more than 20 years of experience in digital advertising, Manny has led engineering and product teams to build a number of enterprise-scaled platforms for digital media trading by leveraging specialities in real-time bidding, data pipeline architecture, natural language processing, and machine learning.

Presentations

Streams: Successfully transforming your business one millisecond at a time Session

In 2016, digital advertising overtook TV in spend, requiring companies to cut through the noise to reach their audience. Manny Puentes explains how Rebel AI decides which ads to serve across devices and how it delivers multidimension reporting in milliseconds.

Kishore Reddipalli is a senior technical architect at GE Digital leading the product architecture for Predix product operations optimization. Kishore has been with GE for nearly eight years, working with various industrial business domains including oil and gas, transportation, power, and renewables. He is an expert in building big data applications. Prior to joining GE Digital, Kishore worked in GE Healthcare, where he helped build the next-generation EMR platform Qualibria.

Presentations

Optimizing industrial operations in real time using the big data ecosystem Session

Kishore Reddipalli explores how to stream data at a large scale from the edge to the cloud to the client, detect anomalies, analyze machine data in stream and rest in an industrial world, and optimize the industrial operations by providing real-time insights and recommendations using big data technologies.

Siva Raghupathy leads the Americas Big Data Solutions Architecture team at AWS, where he guides developers and architects in building successful big data solutions on AWS. Previously, as a principal technical program manager for AWS Database Service, Siva gathered emerging NoSQL requirements and wrote the first version of DynamoDB product specification. Later, as a development manager for Amazon Relational Database Services (RDS), he drove several enhancements. Prior to AWS, Siva spent several years at Microsoft.

Presentations

Serverless big data architectures: Design patterns and best practices (sponsored by AWS) Session

Siva Raghupathy and Ben Snively explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, they explain when and how you can use serverless technologies to streamline data processing and share a reference architecture using a combination of cloud and open source technologies.

Prasanna Rajaperumal is a senior engineer at Uber working on building the next generation Uber data infrastructure and building data systems that scale along with Uber’s hyper growth. Over the last six months, he has been focused on building a library that ingests change logs into large HDFS datasets, optimized for analytical workloads. Prasanna has held various roles at small to large companies building data systems. Previously, he was a software engineer at Cloudera working on building out data infrastructure for indexing and visualizing customer log files.

Presentations

Hoodie: Incremental processing on Hadoop at Uber Session

Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. Vinoth Chandar and Prasanna Rajaperumal introduce Hoodie, a newly open sourced system at Uber that adds new incremental processing primitives to existing Hadoop technologies to provide near-real-time data at 10x reduced cost.

Karthik Ramasamy is the engineering manager and technical lead for real-time analytics at Twitter. Karthik is the cocreator of Heron and has more than two decades of experience working in parallel databases, big data infrastructure, and networking. He cofounded Locomatix, a company that specializes in real-time stream processing on Hadoop and Cassandra using SQL, which was acquired by Twitter. Before Locomatix, he had a brief stint with Greenplum, where he worked on parallel query scheduling. Greenplum was eventually acquired by EMC for more than $300M. Prior to Greenplum, Karthik was at Juniper Networks, where he designed and delivered platforms, protocols, databases, and high-availability solutions for network routers that are widely deployed in the Internet. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik has a PhD in computer science from UW Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Anomaly detection in real-time data streams using Heron Session

Anomaly detection plays a key role in the context of analysis of real-time streams. This is exemplified by, say, detection incidents in real life from tweet storms. Arun Kejariwal and Karthik Ramasamy walk you through how anomaly detection is supported in real-time data streams in Heron—the streaming system built in-house at Twitter (and open sourced) for real-time computation.

Satya Ramaswamy is vice president and the global head of the Digital Enterprise unit within Tata Consultancy Services, where he leads the worldwide organization that helps customers reimagine business models, products and services, customer segments, channels, business processes, and workplaces by leveraging the “digital five forces”: mobility, big data, social media, the cloud, and artificial intelligence and robotics. He is particularly interested in the intersection of the digital and physical, where digital tools can be deployed to improve customer experience and organizational performance in physical settings, particularly using IoT technologies. Satya has more than 23 years of experience in digital technologies spanning engineering, product management, strategy consulting, and global organizational leadership. He has contributed to the mobile industry since the earliest days of digital phones. Satya has been part of two successful startup companies in the mobile application and big data spaces. His counsel is sought-after by clients in multiple industries across North America, Europe, Latin America, Japan, and the wider Asia-Pacific. He has been quoted by media across the world on the evolution and impact of digital technologies on the modern day enterprise and has been featured in the Wall Street Journal, Harvard Business Review, Financial Times, Fortune, Singapore Business Times, CIO.com, the Telegraph, Forbes, and several industry publications. Satya holds 10 US patents. He has a PhD in distributed computing from the Indian Institute of Technology, Chennai and an MBA in marketing and analytical consulting from the Kellogg School of Management.

Presentations

Executive Briefing: Artificial intelligence Session

Satya Ramaswamy and Sunil Karkeraof offer an overview of the recent technical advances that have made the current AI revolution possible, convincingly answering the "why now?" question.

Radhika Ravirala is a solutions architect at Amazon Web Services, where she helps customers craft distributed, robust cloud applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. Radhika enjoys spending time with her family, walking her dog, doing Warrior X-Fit, and playing an occasional hand at Smash Bros.

Presentations

Building your first big data application on AWS Tutorial

Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Ben Snively, Radhika Ravirala, Ryan Nienhuis, and Dario Rivera walk you through building a big data application using open source technologies, such as Apache Hadoop, Spark, and Zeppelin, and AWS managed services, such as Amazon EMR, Amazon Kinesis, and more.

Roger Rea is the product manager for InfoSphere Streams at IBM, where he is responsible for driving market share growth and leading development, marketing, sales, finance, service, and support. Roger has held a variety of sales, technical, educational, marketing, and management jobs at IBM, Skill Dynamics, and Tivoli Systems. He has received four 100% Clubs for exceeding sales targets in large account and general business territories, numerous excellence awards for development and delivery of competitive, strategic, and Unix marketing courses, and several teamwork awards for successes like revenue growth of 67% in one year as brand manager for Intel-based systems management. He spoke about on-time installation of a directory assistance system with multiple sites in two states at the Systems Engineering Symposium. Roger holds a BS in mathematics and computer science (cum laude) from UCLA and a master’s certificate in project management from George Washington University. He lives with his wife and two children in Cary, North Carolina, where he enjoys kayaking, cooking, reading, and singing in the choir.

Presentations

Top enterprise use cases for streaming and machine learning (sponsored by IBM) Session

Roger Rea and Jorge Castanon outline the top enterprise use cases for streaming and machine learning.

Warren Reed is a data scientist at the US Treasury’s Office of Financial Research (OFR), where he leads the OFR’s Monitors program, which is developing interactive tools and a data platform to assess, measure, and monitor risk across the financial system. Before joining the OFR, Warren was a data scientist at startup Gro Intelligence and a quantitative trader at Barclays Capital. He holds an MS in applied informatics from New York University and a BS in chemical engineering from Columbia University.

Presentations

Building interactive data products for risk measurement and monitoring Session

Warren Reed explains how he and his team at the US Treasury’s Office of Financial Research leverage data visualization techniques to build interactive data products for risk measurement and monitoring.

Tom Reilly is the CEO of Cloudera. Tom has had a distinguished 30-year career in the enterprise software market. Previously, Tom was vice president and general manager of enterprise security at HP; CEO of enterprise security company ArcSight, where he led the company through a successful initial public offering and subsequent sale to HP; and vice president of business information services for IBM, following the acquisition of Trigo Technologies Inc., a master data management (MDM) software company, where he served as CEO. He currently serves on the boards of Jive Software, privately held Ombud Inc., ThreatStream Inc., and Cloudera. Tom holds a BS in mechanical engineering from the University of California, Berkeley.

Presentations

Becoming smarter about credible news Keynote

Data helps us understand our market in new and novel ways. In today's world, sifting through the noise in modern journalism means navigating enormous amounts of data, news, and tweets. Tom Reilly and Khalid Al-Kofahi explain how Thomson Reuters is leveraging big data and machine learning to chase down leads, verify sources, and determine what's newsworthy.

Fred Reiss is chief architect and one of the founding employees of the IBM Spark Technology Center in San Francisco. Previously, Fred worked for IBM Research Almaden for nine years, where he worked on the SystemML and SystemT projects as well as on the research prototype of DB2 with BLU Acceleration. He has over 25 peer-reviewed publications and six patents. Fred holds a PhD from UC Berkeley.

Presentations

Compressed linear algebra in Apache SystemML Session

Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines.

Leveraging deep learning to predict breast cancer proliferation scores with Apache Spark and Apache SystemML Session

Estimating the growth rate of tumors is a very important but very expensive and time-consuming part of diagnosing and treating breast cancer. Michael Dusenberry and Frederick Reiss describe how to use deep learning with Apache Spark and Apache SystemML to automate this critical image classification task.

Andreas Ribbrock is principal data scientist at #zeroG, a Lufthansa Systems company, where he is building a big data architecture and data science team. Previously, Andreas was the head of product management and data scientist for Cupertino- and Cologne-based IoT database startup ParStream and the team leader for the data science practice of analytics heavyweight Teradata, where he delivered leading-edge data science solutions in various industries for global players like DHL Express, Deutsche Post, Lufthansa, Otto Group, Siemens, Deutsche Telekom, Vodafone Australia, and Volkswagen. He has presented at international conferences on topics related to big data architectures, data science, and data warehousing. Andreas holds a PhD in computer science from Bonn University. His fields of study included signal processing, data structures, and algorithms for content-based retrieval in nonrelationally structured data like images, audio signals, and 3D protein molecules.

Presentations

How Lufthansa German Airlines is using data analytics to create the next level of customer experience Tutorial

The aviation industry is facing a huge pressure in costs as well as a profound disruption in marketing and service. With ticket revenues dropping, increasing customer loyalty is key. Andreas Ribbrock explains how Lufthansa German Airlines uses data science and data-driven decision making to create the next level of digital customer experience along the full customer journey.

Eric Richardson is a senior engineer at the American Chemical Society. He enjoys late-night coding sessions and long walks through architectures. Eric believes in architectures built on sound principles. He likes a tall, frosty glass of collaboration and information sharing and knows that people make better decisions when they know the landscape. When the time is right you must free your architecture, and let it find its own path—it will exceed your wildest hopes…and fears.

Presentations

Architecting an enterprise data hub in a 110-year-old company Session

Eric Richardson explains how ACS used Hadoop, HBase, Spark, Kafka, and Solr to create a hybrid cloud enterprise data hub that scales without drama and drives adoption by ease of use, covering the architecture, technologies used, the challenges faced and defeated, and problems yet to solve.

Dario Rivera is a solutions architect at Amazon Web Services, where he helps customers to get the most out of AWS. A 20-year IT veteran, Dario has also worked widely within the public sector, holding positions within the DOD, FBI, DHS, and DEA. From highly available, scalable, and elastic architectures to complex enterprise systems with zero down-time availability, Dario is always on the lookout for a challenge to change the world through customer success. Dario has presented at conferences and venues around the world, including Re:Invent, Strata + Hadoop World, HIMMS, and Oxford University.

Presentations

Building your first big data application on AWS Tutorial

Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Ben Snively, Radhika Ravirala, Ryan Nienhuis, and Dario Rivera walk you through building a big data application using open source technologies, such as Apache Hadoop, Spark, and Zeppelin, and AWS managed services, such as Amazon EMR, Amazon Kinesis, and more.

Alex Rivlin leads the team responsible for dynamic threat intelligence at FireEye, which includes a crowdsourced malware exchange across FireEye customers and a malware analytics platform supporting FireEye’s research. For two decades, Alex developed novel analytical capabilities for high-tech companies in projects such as semiconductor failure analysis, supply chain optimization, pricing, and cybersecurity. Previously, Alex worked at Altera (currently part of Intel), where he was hired to develop analytical platform for semiconductor test. His very first project earned a US patent for optimization of bulk loading of data to RDBMS. He later joined supply chain optimization project, where he was responsible for operational analytics. One of his developments allowed real-time reallocation of materials, a feature not available in any commercial packages. Alex also spent time at Flextronics, where he was in charge of project management for implementation of global procurement solution and master data management.

Presentations

FireEye's journey migrating 25 TB of RDBMS data to Hadoop Session

Ganesh Prabhu, Alex Rivlin, and Vivek Agate share an approach that enabled a small team at FireEye to migrate 20 TB of RDBMS data comprised of 250+ tables and nearly 2,000 partitions to Hadoop and an adaptive platform that allows migration of a rapidly changing dataset to Hive. Along the way, they explore some of the challenges typical for a company implementing Hadoop.

Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.

Presentations

BI and SQL analytics with Hadoop in the cloud Session

Henry Robinson and Alex Gutow explain how to best take advantage of the flexibility and cost-effectiveness of the cloud with your BI and SQL analytic workloads using Apache Hadoop and Apache Impala (incubating) to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments.

Alan Ross is a senior principal engineer and chief cloud security architect at Intel. Alan has more than 20 years of information security experience in various capacities, from policy and awareness and security/risk analysis to engineering and architecture. Previously, Alan worked as a security administrator and engineer for two global companies, focusing on network, host, and application security. He has 21 US patents and many others pending relating to security and manageability of systems and networks. Alan is currently leading activities around Open Network Insight, an open source project for advanced analytics of network telemetry.

Presentations

Paint the landscape and secure your data center with Apache Spot Session

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

Edgar Ruiz is a solutions engineer at RStudio with a background in deploying enterprise reporting and business intelligence solutions. He is the author of multiple articles and blog posts sharing analytics insights and server infrastructure for data science. Recently, Edgar authored the “Data Science on Spark using sparklyr” cheat sheet.

Presentations

Sparklyr: An R interface for Apache Spark Session

Sparklyr makes it easy and practical to analyze big data with R—you can filter and aggregate Spark DataFrames to bring data into R for analysis and visualization and use R to orchestrate distributed machine learning in Spark using Spark ML and H2O SparkingWater. Edgar Ruiz walks you through these features and demonstrates how to use sparklyr to create R functions that access the full Spark API.

Serdar Sahin leads big data and cloud initiatives at Peak Games, one of the world’s leading mobile gaming companies, with multiple top-10 grossing titles and tens of millions daily active users. Serdar and his team build and scale Peak Games’s data platform that provides self-service internal products and tools that process hundreds of terabytes data, which allows other teams to take actions in real time without any technical knowledge. Previously, Serdar was instrumental in a number of technology startups in Turkey and Australia. He holds a bachelor’s degree with distinction in information technologies from Central Queensland University.

Presentations

How Peak Games is building analytics infrastructure to improve user experience (sponsored by Snowflake) Session

Peak Games, a leading online and mobile company, unites 30 million monthly unique players with free, culturally relevant, community-driven games. Serdar Sahin shares the company's journey evaluating MPP columnar databases against Hadoop to find the right data infrastructure to enable the company to handle the unpredictable popularity of newly launched games.

Jiphun Satapathy is a senior security architect at Visa, where he leads the security architecture of Visa’s digital and mobile products like Visa Token Service, Visa Checkout, Visa Direct, which are used by millions of users. Jiphun’s areas of expertise include application security, data security, and cloud security. Previously, he was a software architect at Intel, where he led multiple teams to deliver products leveraging HW security.

Presentations

End-to-end security for Kafka, Spark ML, and Hadoop Session

Apache Kafka is used by over 35% of Fortune 500 companies to store and process some of their most sensitive datasets. Ajit Gaddam and Jiphun Satapathy provide a security reference architecture to secure your Kafka cluster while leveraging it to support your organization's cybersecurity requirements.

Andrei Savu is a software engineer at Cloudera, where he’s working on Cloudera Director, a product that makes Hadoop deployments in cloud environments easy and more reliable for customers.

Presentations

A deep dive into leveraging cloud infrastructure for data engineering workloads Session

Cloud infrastructure, with a scalable data store and elastic compute, is particularly well suited for large-scale data engineering workloads. Andrei Savu and Jennifer Wu explore the latest cloud technologies and outline cost, security, and ease-of-use considerations for data engineers.

Deploying and operating big data analytic apps on the public cloud Tutorial

Jennifer Wu, Eugene Fratkin, Andrei Savu, and Tony Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

William Schmarzo is the CTO of Dell EMC, where he is responsible for setting the strategy and defining the service line offerings and capabilities for the EMC Consulting Enterprise Information Management and Analytics service line. Bill has more than two decades of experience in data warehousing, BI, and analytics applications. He authored the Business Benefits Analysis methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements and has served on the Data Warehouse Institute’s faculty as the head of the analytic applications curriculum. Previously, Bill was the vice president of analytics at Yahoo, where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of actionable insights through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing, and sales of their industry-defining analytic applications. Bill is the author of Big Data: Understanding How Data Powers Big Business (forthcoming from Wiley), has written several white papers, and coauthored a series of articles on analytic applications with Ralph Kimball. He is a frequent speaker on the use of big data and advanced analytics to power an organization’s key business initiatives. Bill holds a master’s degree in business administration from the University of Iowa and a bachelor of science degree in mathematics, computer science, and business administration from Coe College. You can find out more on the EMC website.

Presentations

Determining the economic value of your data Tutorial

Organizations need a model to measure how effectively they are using data and analytics. Once they know where they are and where they need to go, they then need a framework to determine the economic value of their data. William Schmarzo explores techniques for getting business users to “think like a data scientist” so they can assist in identifying data that makes the best performance predictors.

Office Hour with William Schmarzo (Dell EMC) Office Hour

If you need to determine the economic value of your data, Bill can help.

Robert Schroll is the data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning with TensorFlow 2-Day Training

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies, Inc. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for six years.

Presentations

Cloudy with a chance of on-prem Tutorial

The cloud is becoming pervasive, but it isn’t always full of rainbows. Defining a strategy that works for your company or for your use cases is critical to ensuring success. Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations.

Jonathan Seidman is a software engineer on the Partner Engineering team at Cloudera. Previously, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures AMA

Mark Grover and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Rama Sekhar is principal at Norwest Venture Partners, where he focuses on early- to late-stage venture investments in enterprise and infrastructure, including cloud, big data, DevOps, cybersecurity, and networking. Rama’s current investments include Agari, Bitglass, and Qubole. Rama was previously an investor in Morta Security (acquired by Palo Alto Networks), Pertino Networks (acquired by Cradlepoint), and Exablox (acquired by StorageCraft). Before joining Norwest, Rama was with Comcast Ventures; a product manager at Cisco Systems, where he defined product strategy for the GSR 12000 Series and CRS-1 routers—$1B+ networking products in the carrier and data center markets; and a sales engineer at Cisco Systems, where he sold networking and security products to AT&T. Rama holds an MBA from the Wharton School of the University of Pennsylvania with a double major in finance and entrepreneurial management and a BS in electrical and computer engineering, with high honors, from Rutgers University.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Karthik Sethuraman is a senior software engineer at Trifacta, where, in addition to working on performance, Trifacta’s wrangle language, and core user experience, he helps build the inference layer that powers Trifacta’s predictive interaction. Previously, Karthik worked at Palantir and did research in computational biology.

Presentations

Intelligent pattern profiling on semistructured data with machine learning Session

It's well known that data analysts spend 80% of their time preparing data and only 20% analyzing it. In order to change that ratio, organizations must build tools specifically designed for working with ad hoc (semistructured) data. Sean Kandel and Karthik Sethuraman explore a new technique leveraging machine learning to discover and profile the inherent structure in ad hoc datasets.

Maya Shankar served as a senior advisor in the Obama White House for four years, where she founded and served as chair of the Social and Behavioral Sciences Team (SBST), a team of scientists charged with improving public policy using research insights about human behavior. In response to SBST’s impact, President Obama signed Executive Order 13707, “Using Behavioral Science Insights to Better Serve the American People,” which institutionalized SBST and codified the practice of applying behavioral science insights to federal policy. In 2016, Maya was asked to serve as the first behavioral science advisor to the United Nations. Previously, she completed a postdoctoral fellowship in cognitive neuroscience at Stanford. Maya holds a PhD from Oxford, earned while on a Rhodes Scholarship, and a BA from Yale in cognitive science. She is a graduate of the Juilliard School of Music precollege division and a former private violin student of Itzhak Perlman. Maya recently joined Google as their head of behavioral science.

Presentations

Improving public policy with behavioral science Keynote

Maya Shankar discusses the motivation for and impact of the White House Social and Behavioral Sciences Team and shares lessons learned building a startup within the federal government.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Gwen Shapira AMA

Join Confluent system architect Gwen Shapira to discuss Apache Kafka and its use cases, data streaming platforms, and microservices.

One cluster does not fit all: Architecture patterns for multicluster Apache Kafka deployments Session

There are many good reasons to run more than one Kafka cluster. . .and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions to help you better choose the right architecture for your needs.

Stream me up, Scotty: Transitioning to the cloud using a streaming data platform Session

Gwen Shapira and Bob Lehmann share their experience and patterns building a cross-data-center streaming data platform for Monsanto. Learn how to facilitate your move to the cloud while "keeping the lights on" for legacy applications. In addition to integrating private and cloud data centers, you'll discover how to establish a solid foundation for a transition from batch to stream processing.

Ben Sharma is CEO and cofounder of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions. With previous experience in technology leadership positions for NetApp, Fujitsu, and others, Ben’s expertise ranges from development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization, and storage. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes, and he holds two patents.

Presentations

Building a modern data architecture (sponsored by Zaloni) Session

When building your data stack, architecture could be your biggest challenge—yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin when assembling a scalable data architecture? Ben Sharma shares real-world lessons and best practices to get you started.

Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

Presentations

Ask me anything: Unraveling data with Spark using machine learning AMA

Join Vartika Singh, Jayant Shekha, and Jeffrey Shmain to ask questions about their tutorial Unraveling Data with Spark Using Machine Learning or anything else Spark related.

Unraveling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Ask me anything: Unraveling data with Spark using machine learning AMA

Join Vartika Singh, Jayant Shekha, and Jeffrey Shmain to ask questions about their tutorial Unraveling Data with Spark Using Machine Learning or anything else Spark related.

Unraveling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value.

Evangelos Simoudis is the cofounder and managing director of Synapse Partners, a Silicon Valley-based venture firm, where he invests in early-stage startups working on big data applications and advises global corporations on startup-driven innovation and big data strategies. Evangelos began his investing career as a partner at Apax Partners and continued as a senior managing director at Trident Capital. His current directorships include Amobee/Singtel and Kite. Prior directorships include Brightroll (acquired by Yahoo), Bristol Technology (acquired by Hewlett-Packard), Composite Software (acquired by Cisco), Confluent Software (acquired by Oracle Corporation), Exelate (acquired by Nielsen), and Princeton Softech (acquired by IBM). Prior to his investing and advisory career, Evangelos had a 20-year career in high-tech executive roles, including two startup CEO positions.

A recognized thought leader on big data, corporate innovation, cloud computing, and digital marketing platforms, Evangelos is the author of The Big Data Opportunity in Our Driverless Future. In 2014, he was named a Power Player in Digital Media and in 2012 as a top investor in online advertising. He is a member of Caltech’s Information Science and Technology advisory board, the advisory board of Brandeis International School of Business, the advisory board of New York’s Center for Urban Science and Progress, and the advisory board of Securing America’s Future Energy. Evangelos holds a PhD in computer science from Brandeis University in the area of machine learning and large databases and a BS in electrical engineering from Caltech.

Presentations

Big data opportunities in next-generation mobility Tutorial

Evangelos Simoudis explores how data generated in and around increasingly autonomous vehicles and by on-demand mobility services will enable the development of new transportation experiences and solutions for a diverse set of industries and government types.

Jiri Simsa is a software engineer at Alluxio and one of the maintainers and top contributors of the Alluxio open source project. Previously, he was a software engineer at Google, where he worked on the distributed framework for the IoT. Jiri holds a PhD in computer science from Carnegie Mellon University, where his work focused on systematic and scalable testing of concurrent systems.

Presentations

Effective Spark with Alluxio Session

Alluxio bridges Spark applications with various storage systems and further accelerates data-intensive applications. Gene Pang and Jiri Simsa introduce Alluxio, explain how Alluxio can help Spark be more effective, show benchmark results with Spark RDDs and DataFrames, and describe production deployments with both Alluxio and Spark working together.

Vartika Singh is a solutions architect at Cloudera with over 10 years of experience applying machine-learning techniques to big data problems.

Presentations

Ask me anything: Unraveling data with Spark using machine learning AMA

Join Vartika Singh, Jayant Shekha, and Jeffrey Shmain to ask questions about their tutorial Unraveling Data with Spark Using Machine Learning or anything else Spark related.

Unraveling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value.

Mehmet Irmak Sirer is a partner and data scientist at Datascope Analytics, where he has helped companies across industries solve problems with data, from small companies to members of the Fortune 50. Irmak has conducted and published academic research on a wide range of topics, including student choices in public schools, web browsing behavior of masses, global airline networks, species conservation in ecology, language topic models, and optimizing DNA sequences for high gene expression among others. Technically, Irmak is a material scientist (with both a BS and an MS from Sabanci University, Turkey), a chemical and biological engineer (with both an MS and a PhD from Northwestern University), and an art historian (his minor at Sabanci University). Practically, he believes in merging knowledge from different disciplines to ask and answer the right questions. When he is not striving for this, he believes in movies, bourbon, and Elliott Smith.

Presentations

Four key skills for vice presidents, directors, and managers in data-driven organizations Session

In a data-driven organization, vice presidents, directors, and managers play a crucial role as translators between senior leadership and data science teams. They don’t need to be full-fledged data scientists, but they do need data science "street smarts” in order to succeed in this critical task. Mehmet Irmak Sirer outlines the skills they need and gives practical ways to improve them.

Ram Shankar is a security data wrangler in Azure Security Data Science, where he works on the intersection of ML and security. Ram’s work at Microsoft includes a slew of patents in the large intrusion detection space (called “fundamental and groundbreaking” by evaluators). In addition, he has given talks in internal conferences and received Microsoft’s Engineering Excellence award. Ram has previously spoken at data-analytics-focused conferences like Strata San Jose and the Practice of Machine Learning as well as at security-focused conferences like BlueHat, DerbyCon, FireEye Security Summit (MIRCon), and Infiltrate. Ram graduated from Carnegie Mellon University with master’s degrees in both ECE and innovation management.

Presentations

Operationalizing security data science for the cloud: Challenges, solutions, and trade-offs Session

Ram Shankar Siva Kumar and Andrew Wicker explain how to operationalize security analytics for production in the cloud, covering a framework for assessing the impact of compliance on model design, six strategies and their trade-offs to generate labeled attack data for model evaluation, key metrics for measuring security analytics efficacy, and tips to scale anomaly detection systems in the cloud.

Kaarthik Sivashanmugam is a principal software engineer on the Shared Data platform team at Microsoft. Kaarthik is the tech lead for the Mobius project specializing in Spark Streaming. Prior to joining the Shared Data platform team, he was on the Bing Ads team, where he built a near real-time analytics platform using Kafka, Storm, and Elasticsearch and used it to implement data processing pipelines. Previously, at Microsoft, Kaarthik was involved in the development of Data Quality Services in Azure and also contributed to multiple releases of SQL Server Integration Services as a hands-on engineering manager. Before joining Microsoft, Kaarthik was a senior software engineer in a semantic technology startup, where he built an ontology-based semantic metadata platform and used it to implement solutions for KYC/AML analytics.

Presentations

Spark at scale in Bing: Use cases and lessons learned Session

Spark powers various services in Bing, but the Bing team had to customize and extend Spark to cover its use cases and scale the implementation of Spark-based data pipelines to handle internet-scale data volume. Kaarthik Sivashanmugam explores these use cases, covering the architecture of Spark-based data platforms, challenges faced, and the customization done to Spark to address the challenges.

Crystal Skelton is an associate in Kelley Drye & Warren’s Los Angeles office, where she represents a wide array of clients from tech startups to established companies in privacy and data security, advertising and marketing, and consumer protection matters. Crystal advises clients on privacy, data security, and other consumer protection matters, specifically focusing on issues involving children’s privacy, mobile apps, data breach notification, and other emerging technologies and counsels clients on conducting practices in compliance with the FTC Act, the Children’s Online Privacy Protection Act (COPPA), the Gramm-Leach-Bliley Act, the GLB Safeguards Rule, Fair Credit Reporting Act (FCRA), the Fair and Accurate Credit Transactions Act (FACTA), and state privacy and information security laws. She regularly drafts privacy policies and terms of use for websites, mobile applications, and other connected devices.

Crystal also helps advertisers and manufacturers balance legal risks and business objectives to minimize the potential for regulator, competitor, or consumer challenge while still executing a successful campaign. Her advertising and marketing experience includes counseling clients on issues involved in environmental marketing, marketing to children, online behavioral advertising (OBA), commercial email messages, endorsements and testimonials, food marketing, and alcoholic beverage advertising. She represents clients in advertising substantiation proceedings and other matters before the Federal Trade Commission (FTC), the US Food and Drug Administration (FDA), and the Alcohol and Tobacco Tax and Trade Bureau (TTB) as well as in advertiser or competitor challenges before the National Advertising Division (NAD) of the Council of Better Business Bureaus. In addition, she assists clients in complying with accessibility standards and regulations implementing the Americans with Disabilities Act (ADA), including counseling companies on website accessibility and advertising and technical compliance issues for commercial and residential products. Prior to joining Kelley Drye, Crystal practiced privacy, advertising, and transactional law at a highly regarded firm in Washington, DC, and as a law clerk at a well-respected complex commercial and environmental litigation law firm in Los Angeles, CA. Previously, she worked at the law firm featured in the movie Erin Brockovich, where she worked directly with Erin Brockovich and the firm’s name partner to review potential new cases.

Presentations

Executive Briefing: Doing data right—Legal best practices for making your data work Session

Big data promises enormous benefits for companies, and new innovations in this space only mean more data collection is required. Having a solid understanding of legal obligations will help you avoid the legal snafus that can come with collecting big data. Alysa Hutnik and Crystal Skelton outline legal best practices and practical tips to avoid becoming a big data “don’t.”

Jason Slepicka is a senior data engineer at DataScience, where he specializes in conducting database research aimed at building query optimizers for big data systems. His research drives improvements in the performance of Spark while querying relational databases and enables other query languages to be translated into Spark. Jason is also an instructor for DataScience’s elite residency education program, DS12, where he teaches students about data engineering, machine-learning algorithms, and Spark. Jason is pursuing a PhD in computer science at the University of Southern California Information Sciences Institute. His work uses information integration and semantic web techniques to build knowledge graphs for partners (including DARPA) to fight human trafficking, firearms trafficking, and patent trolls.

Presentations

Advanced data federation and cost-based optimization using Apache Calcite and Spark SQL (sponsored by DataScience) Session

Apache Spark has become the go-to system for servicing ad hoc queries, but the Catalyst optimizer still lacks many of the pushdown optimizations necessary to take advantage of native database features. Jason Slepicka explains how DataScience replaced Catalyst with Apache Calcite to achieve performance improvements of two orders of magnitude when querying SQL and NoSQL databases with Spark.

Ben Snively is a specialist solutions architect on the Amazon Web Services Public Sector team, where he specializes in big data, analytics, and search. Previously, Ben was an engineer and architect on DOD contracts, where he worked with Hadoop and big data solutions. He has over 11 years of experience creating analytical systems. Ben holds both a bachelor’s and master’s degree in computer science from Georgia Institute of Technology and a master’s in computer engineering from University of Central Florida.

Presentations

Building your first big data application on AWS Tutorial

Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Ben Snively, Radhika Ravirala, Ryan Nienhuis, and Dario Rivera walk you through building a big data application using open source technologies, such as Apache Hadoop, Spark, and Zeppelin, and AWS managed services, such as Amazon EMR, Amazon Kinesis, and more.

Serverless big data architectures: Design patterns and best practices (sponsored by AWS) Session

Siva Raghupathy and Ben Snively explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, they explain when and how you can use serverless technologies to streamline data processing and share a reference architecture using a combination of cloud and open source technologies.

Emily Spahn is a data scientist at ProKarma. Emily enjoys leveraging data to solve a wide range of problems. Previously, she worked as a civil engineer with a focus on hydraulic and hydrologic modeling and has worked in government, private industry, and academia over her career. Emily holds degrees in physics and environmental engineering.

Presentations

Saving lives with data: Identifying patients at risk of decline Session

Many hospitals combine early warning systems with rapid response teams (RRT) to detect patient decline and respond with elevated care. Predictive models can minimize RRT events by identifying at-risk patients, but modeling is difficult because events are rare and features are varied. Emily Spahn explores the creation of one such patient-risk model and shares lessons learned along the way.

Raghotham Sripadraj is cofounder and data scientist at Unnati Data Labs, where he is building end-to-end data science systems in the fields of fintech, marketing analytics, and event management. Raghotham is also a mentor for data science on Springboard. Previously, at Touchpoints Inc., he single-handedly built a data analytics platform for a fitness wearable company; at Redmart, he worked on the CRM system and built a sentiment analyzer for Redmart’s social media; and at SAP Labs, he was a core part of what is currently SAP’s framework for building web and mobile products, as well as a part of multiple company-wide events helping to spread knowledge both internally and to customers. Drawing on his deep love for data science and neural networks and his passion for teaching, Raghotham has conducted workshops across the world and given talks at a number of data science conferences. Apart from getting his hands dirty with data, he loves traveling, Pink Floyd, and masala dosas.

Presentations

Making architecture choices for small and big data problems Session

Not all data science problems are big data problems. Lots of small and medium product companies want to start their journey to become data driven. Nischal HP and Raghotham Sripadraj share their experience building data science platforms for various enterprises, with an emphasis on making the right architecture choices and using distributed and fault-tolerant tools.

Julie Steele thinks in metaphors and finds beauty in the clear communication of ideas. She is particularly drawn to visual media as a way to understand and transmit information. Julie is coauthor of Beautiful Visualization (O’Reilly, 2010) and Designing Data Visualizations (O’Reilly, 2012).

Presentations

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Julie Steele, Stephen O'Sullivan, and Scott Kurth field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for and the evolving role of the CDO. Even if you don’t have a specific question, join in to hear what others are asking.

Abdul Subhan is the principal solutions architect for Verizon’s 4G Data Analytics team, where he is responsible for a developing, deploying, and managing a broad range of mission-critical reporting and analytics applications. Previously, Abdul spent close to a decade as an internal technology consultant to various parts of Verizon’s wireless, wireline, broadband, data center, and video businesses and managed data and technology with Nuance Communications and Alcatel Lucent. Abdul holds a master’s degree in computer engineering from King Fahd University of Petroleum & Minerals and an undergraduate degree in telecommunications from Visvesvaraya Technological University.

Presentations

From hours to milliseconds: How Verizon accelerated its mobile analytics Session

With more than 91M customers, Verizon produces oceans of data. The challenge this onslaught presents isn’t one of storage—it’s one of speed. The solution? Harnessing the power of GPUs to access insights in less than a millisecond. Todd Mostak and Abdul Subhan explain how Verizon solved its data challenge by implementing GPU-tuned analytics and visualization.

Sean Suchter is the CTO and cofounder of Pepperdata. Previously, Sean was the founding GM of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search, and managed the Yahoo Search Technology team, the first production user of Hadoop. He joined Yahoo through the acquisition of Inktomi. Sean holds a BS in engineering and applied science from Caltech.

Presentations

Big data for big data: Machine-learning models of Hadoop cluster behavior Session

Sean Suchter and Shekhar Gupta describe the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.

Brian Suda is a master informatician currently residing in Reykjavík, Iceland. Since first logging on in the mid-’90s, he has spent a good portion of each day connected to the internet. When he is not hacking on microformats or writing about web technologies, he enjoys taking kite aerial photography. His own little patch of internet can be found at Suda.co.uk, where many of his past projects, publications, interviews, and crazy ideas can be found.

Presentations

Introduction to visualizations using D3 Tutorial

Visualizations are a key part of conveying any dataset. D3 is the most popular, easiest, and most extensible way to get your data online in an interactive way. Brian Suda outlines best practices for good data visualizations and explains how you can build them using D3.

Jagane Sundar is the CTO at WANdisco. Jagane has extensive big data, cloud, virtualization, and networking experience. He joined WANdisco through its acquisition of AltoStor, a Hadoop-as-a-service platform company. Previously, Jagane was founder and CEO of AltoScale, a Hadoop- and HBase-as-a-platform company acquired by VertiCloud. His experience with Hadoop began as director of Hadoop performance and operability at Yahoo. Jagane’s accomplishments include creating Livebackup, an open source project for KVM VM backup, developing a user mode TCP stack for Precision I/O, developing the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun Microsystems, and creating and selling a 32-bit VxD-based TCP stack for Windows 3.1 to NCD Corporation for inclusion in PC-Xware. Jagane is currently a member of the technical advisory board of VertiCloud. Jagane holds a BE in electronics and communications engineering from Anna University.

Presentations

Replication as a service (sponsored by WANDisco) Session

Jagane Sundar shares a strongly consistent replication service for replicating between cloud object stores, HDFS, NFS, and other S3- and Hadoop-compatible filesystems.

Arvind Surve is software developer in IBM’s Spark Technology Center in San Francisco. Arvind is a SystemML contributor and committer. He has worked for IBM for 17+ years. Arvind has presented at the 2015 Data Engineering Conference in Tokyo and for the Chicago Spark User group. He holds an MS in digital electronics and communication systems and an MBA in finance and marketing.

Presentations

Compressed linear algebra in Apache SystemML Session

Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines.

Mahesh Goud is a data scientist on Ticketmaster’s Data Science team, where he focuses on the automation and optimization of its paid customer acquisition systems and works on quantifying and modeling various data sources used by the Paid Acquisition Optimization Engine to increase the efficacy of marketing spend. He has helped develop and test the platform since its conception. Previously, he was a software engineer at Citigroup, where he was involved in the development of a real-time stock pricing engine. Mahesh holds a master’s degree in computer science specializing in data science from the University of Southern California and a bachelor’s degree with honors in computer science specializing in computer vision from the International Institute of Information Techonology, Hyderabad.

Presentations

A contextual real-time bidding engine for search engine marketing Session

Mahesh Goud shares success stories using Ticketmaster's large-scale contextual bandit platform for SEM, which determines the optimal keyword bids under evolving keyword contexts to meet different business requirements, and explores Ticketmaster's streaming pipeline, consisting of Storm, Kafka, HBase, the ELK Stack, and Spring Boot.

Shubham Tagra is a member of the techinal staff at Qubole working on Presto and Hive development and making these solutions cloud ready. Previously, Shubham worked at NetApp on its storage area network. Shubham holds a bachelor’s degree in computer engineering from the National Institue of Technology, Karnataka, India.

Presentations

RubiX: A caching framework for big data engines in the cloud Session

Shubham Tagra offers an introduction to RubiX, a lightweight, cross-engine caching solution that works well with optimized columnar formats by caching only the required amount of data. RubiX can be used with any data analytics engine that reads data from remote sources via the Hadoop FileSystem interface without any changes to the source code of those engines.

David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cybersecurity. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies Session

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Daniel Templeton has a long history in high-performance computing, open source communities, and technology evangelism. Today Daniel works on the YARN development team at Cloudera, focused on the resource manager, fair scheduler, and Docker support.

Presentations

Docker on YARN Session

Docker makes it easy to bundle an application with its dependencies and provide full isolation, and YARN now supports Docker as an execution engine for submitted applications. Daniel Templeton explains how YARN's Docker support works, why you'd want to use it, and when you shouldn't.

Jasjeet Thind is the vice president of data science and engineering at Zillow. His group focuses on machine-learned prediction models and big data systems that power use cases such as Zestimates, personalization, housing indices, search, content recommendations, and user segmentation. Prior to Zillow, Jasjeet served as director of engineering at Yahoo, where he architected a machine-learned real-time big data platform leveraging social signals for user interest signals and content prediction. The system powers personalized content on Yahoo, Yahoo Sports, and Yahoo News. Jasjeet holds a BS and master’s degree in computer science from Cornell University.

Presentations

Zillow: Transforming real estate through big data and machine learning Session

Zillow pioneered providing access to unprecedented information about the housing market. Long gone are the days when you needed an agent to get comparables and prior sale and listing data. And with more data, data science has enabled more use cases. Jasjeet Thind explains how Zillow uses Spark and machine learning to transform real estate.

Rocky Tiwari is the manager of innovation and architecture at Transamerica.

Presentations

Transamerica's journey to Customer 360 and beyond Session

Vishal Bamba and Rocky Tiwari offer an overview of Transamerica's Customer 360 platform and the work done afterward to utilize this technology, including graph databases and machine learning to help create targeted segments for products and campaigns.

Wee Hyong Tok is a principal data science manager at Microsoft, where he works with teams to cocreate new value and turn each of the challenges facing organizations into compelling data stories that can be concretely realized using proven enterprise architecture. Wee Hyong has worn many hats in his career, including developer, program/product manager, data scientist, researcher, and strategist, and his range of experience has given him unique super powers to nurture and grow high-performing innovation teams that enable organizations to embark on their data-driven digital transformations using artificial intelligence. He has a passion for leading artificial intelligence-driven innovations and working with teams to envision how these innovations can create new competitive advantage and value for their business and strongly believes in story-driven innovation.

Presentations

Using big data, the cloud, and AI to enable intelligence at scale (sponsored by Microsoft) Session

Wee Hyong Tok and Danielle Dean explain how the global, trusted, and hybrid Microsoft platform can enable you to do intelligence at scale, describing real-life applications where big data, the cloud, and AI are making a difference and how this is accelerating the digital transformation for these organizations at a lighting pace.

Carlo Torniai is head of data science and analytics at Pirelli. Previously, he was a staff data scientist at Tesla Motors. He received his PhD in informatics from the Università degli Studi di Firenze, Italy.

Presentations

How a global manufacturing company built a data science capability from scratch Tutorial

Building a cross-functional data science team at a large, multinational manufacturing company presents a number of cultural, organizational, technical, and operational challenges. Carlo Torniai explains how Pirelli grew an organization that was able to deliver key insights in less than a year and shares advice for both new and established data science teams.

Steven Totman is Cloudera’s big data subject-matter expert, helping companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Steve works with over 180 customers worldwide and helps across verticals in architectures around data management tools, data models, and ethical data usage. Previously, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents in data integration and governance- and metadata-related designs. Although he is based in NYC, Steve is happiest onsite with customers wherever they may be in the world.

Presentations

Big data as a force for good Session

In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved.

Ken Tsai is the vice president and head of data management and PaaS at SAP, where he leads the product marketing efforts for SAP’s in-memory computing platform SAP HANA, HANA Cloud Platform, and the portfolio of SAP data management solutions such as HANA Vora, ASE, IQ, SQL Anywhere, and event stream processing. Ken has 20+ years of experience in the IT industry spanning development, implementation, presales, business development, and product marketing. Ken is a graduate of UC Berkeley.

Presentations

Modernizing business processes with big data: Real-world use cases for production (sponsored by SAP) Session

Ken Tsai and Michael Eacrett explore critical components of enterprise production environments that support day-to-day business processes while ensuring security, governance, and operational administration and share best practices to ensure business value.

Office Hour with Ken Tsai and Michael Eacrett (SAP) Office Hour

If you need your enterprise production environment to not only support day-to-day business processes but also ensure security, governance, and operational administration, Ken and Michael can offer tips, tricks, and best practices and answer your questions.

Teresa Tung is a technology fellow at Accenture Technology Labs, where she is responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s industry platforms like Accenture Digital Connected Products and Accenture Analytics Insights Platform. Teresa holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

DevOps for models: How to manage millions of models in production Session

As Accenture scaled to millions of predictive models, it needed automation to manage models at scale, ensure accuracy, prevent false alarms, and preserve trust as models are created, tested, and deployed into production. Teresa Tung, Jürgen Weichenberger, and Ishmeet Grewal share their approach to implementing DevOps for models and employing a self-healing approach to model lifecycle management.

Executive Briefing: IoT and unconventional data Session

The IoT is driven by outcomes delivered by applications, but to gain operational efficiency, many organizations are looking toward a horizontal platform for delivering and supporting a number of applications. Teresa Tung explores how to choose and implement a platform—and deal with the fact that the platform is horizontal and application outcomes are vertical.

Alexander Ulanov is a senior researcher at Hewlett Packard Labs, where he focuses his research on machine learning on a large scale. Currently, Alexander works on deep learning and graphical models. He has made several contributions to Apache Spark; in particular, he implemented the multilayer perceptron classifier. Previously, he worked on text mining, classification and recommender systems, and their real-world applications. Alexander holds a PhD in mathematical modeling from the Russian Academy of Sciences.

Presentations

Malicious site detection with large-scale belief propagation Session

Alexander Ulanov and Manish Marwah explain how they implemented a scalable version of loopy belief propagation (BP) for Apache Spark, applying BP to large web-crawl data to infer the probability of websites to be malicious. Applications of BP include fraud detection, malware detection, computer vision, and customer retention.

Amy Unruh is a developer programs engineer for the Google Cloud Platform, with a focus on machine learning and data analytics as well as other Cloud Platform technologies. Amy has an academic background in CS/AI and has also worked at several startups, done industrial R&D, and published a book on App Engine.

Presentations

Getting started with TensorFlow Tutorial

Amy Unruh and Yufeng Guo walk you through training and deploying a machine-learning system using TensorFlow, a popular open source library. Amy and Yufeng begin by giving an overview of TensorFlow and demonstrating some fun, already-trained TensorFlow models.

Manjunath Vasishta is the senior manager of the Data Science and Engineering group at Malwarebytes. Previously, Manju spent more than 12 years at Infosys, a global leader in technology services and consulting, where he lead critical projects building and managing big data, data science, business intelligence, and data warehousing solutions for Fortune 50 clients. Being in an industry where the change and the pace of change are accelerating, he has learned the art of “dancing in a hurricane” by reading technology books and listening to music.

Presentations

Building an automation-driven Lambda architecture (sponsored by BMC) Session

Darren Chinen, Sujay Kulkarni, and Manjunath Vasishta demonstrate how to use a Lambda architecture to provide real-time views into big data by combining batch and stream processing, leveraging BMC’s Control-M as a critical component of both batch processing and ecosystem management.

Ashish Verma is a managing director at Deloitte, where he leads the Big Data and IoT Analytics practice, building offerings and accelerators to enhance business processes and effectiveness. Ashish has more than 18 years of management consulting experience helping Fortune 100 companies build solutions that focus on addressing complex business problems related to realizing the value of information assets within an enterprise.

Presentations

Executive Briefing: From data insights to action—Developing a data-driven company culture Session

Ashish Verma explores the challenges organizations face after investing in hardware and software to power their analytics projects and the missteps that lead to inadequate data practices. Ashish explains how to course-correct and implement an insight-driven organization (IDO) framework that enables you to derive tangible value from your data faster.

Naghman Waheed leads the Data Platforms team at Monsanto and is responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order-to-cash, finance, and procurement. Throughout his 20+-year career at Monsanto, Naghman has held a variety of positions in the data space, ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.

Presentations

The enterprise geospatial platform: A perfect fusion of cloud and open source technologies Session

Recently, the volume of data collected from farmers' fields via sensors, rovers, drones, in-cabin technologies, and other sources has forced Monsanto to rethink its geospatial processing capabilities. Naghman Waheed and Martin Mendez-Costabel explain how Monsanto built a scalable geospatial platform using cloud and open source technologies.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the creation of the Lightbend Fast Data Platform, a streaming data platform built on the Lightbend Reactive Platform, Kafka, Spark, Flink, and Mesosphere DC/OS. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects and he is the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Just enough Scala for Spark Tutorial

Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs.

Office Hour with Dean Wampler (Lightbend) Office Hour

Want to use Scala effectively? Or maybe you took Dean's tutorial but still have a question. Dean is an amazing resource to turn to with Scala questions.

Melanie Warrick works as a Senior Developer Advocate at Google with a passion for machine-learning problems at scale. Previous experience includes work as a founding engineer on Deeplearning4J and as a data scientist and engineer at Change.org.

Presentations

What is AI? Tutorial

Melanie Warrick explores the definition of artificial intelligence and seeks to clarify what AI will mean for our world. Melanie summarizes AI’s most important effects to date and demystifies the changes we’ll see in the immediate future, separating myth from realistic expectation.

Jason Waxman is a corporate vice president at Intel in the Data Center Group and general manager of the Data Center Solutions Group. DCG’s objective is to develop and deliver innovative solutions that are of high customer value, easy to deploy, and are cost efficient. Jason has held several roles in enterprise and data center computing at Intel. Before leading the Cloud Platforms Group, he served as general manager of high-density computing and as marketing director of Intel Xeon platforms, leading the definition and introduction of enterprise platforms. Previously, he worked in strategic planning and manufacturing for Emerson Electric and as a management consultant. Jason has been an industry advocate for standards in data center computing and has held board roles in the Open Compute Foundation and the Server System Infrastructure Forum. He also initiated Intel’s role as technical advisor to the Open Data Center Alliance. Jason holds a bachelor’s degree in mechanical engineering, a master’s degree in operations research, and an MBA, all from Cornell University

Presentations

Collaboration in AI benefits humanity (sponsored by Intel) Keynote

Artificial intelligence will accelerate both cancer research and the development of autonomous vehicles. Jason Waxman explains why the ultimate potential of AI will be realized through its societal benefits and positive impact on our world. Collaboration between industry, government, and academia are required to drive this societal innovation and deliver the scale and promise of AI to everyone.

Randy Wei is a software engineer in Uber. He holds a bachelor’s degree in computer science from the University of California, Berkeley.

Presentations

Uber's data science workbench Session

Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring.

Jürgen Weichenberger is a data science senior principal at Accenture Analytics, where he is currently working within resources industries with interests in smart grids and power, digital plant engineering, and optimization for upstream industries and the water industry. Jürgen has over 15 years of experience in engineering consulting, data science, big data, and digital change. In his spare time, he enjoys spending time with his family and playing golf and tennis. Jürgen holds a master’s degree (with first-class honors) in applied computer science and bioinformatics from the University of Salzburg.

Presentations

DevOps for models: How to manage millions of models in production Session

As Accenture scaled to millions of predictive models, it needed automation to manage models at scale, ensure accuracy, prevent false alarms, and preserve trust as models are created, tested, and deployed into production. Teresa Tung, Jürgen Weichenberger, and Ishmeet Grewal share their approach to implementing DevOps for models and employing a self-healing approach to model lifecycle management.

Jay White Bear is a data scientist and advisory software engineer at IBM. Jay holds a degree in computer science from the University of Michigan, where her work focused on databases, machine learning, computational biology, and cryptography. Jay has also done work on multiobjective optimization, computational biology, and bioinformatics at the University of California, San Francisco and machine learning, multiobjective optimization for path planning, and cryptography at McGill University.

Presentations

The IoT and the autonomous vehicle in the clouds: Simultaneous localization and mapping (SLAM) with Kafka and Spark Streaming Tutorial

The simultaneous localization and mapping (SLAM) problem is the cutting edge of robotics for autonomous vehicles and a key challenge in both industry and research. Jay White Bear shares a new integrated framework that demonstrates a constrained SLAM using online algorithms to navigate and map in real time using the Turtlebot II.

Andrew Wicker is a machine learning engineer in the Security division at Microsoft, where his current work focuses on researching and developing machine-learning solutions to protect identities in the cloud. Andrew’s previous work includes developing machine-learning models to detect safety events in an immense amount of FAA radar data and working on the development of a distributed graph analytics system. His expertise encompasses the areas of artificial intelligence, graph analysis, and large-scale machine learning. Andrew holds a BS, an MS, and a PhD in computer science from North Carolina State University.

Presentations

Operationalizing security data science for the cloud: Challenges, solutions, and trade-offs Session

Ram Shankar Siva Kumar and Andrew Wicker explain how to operationalize security analytics for production in the cloud, covering a framework for assessing the impact of compliance on model design, six strategies and their trade-offs to generate labeled attack data for model evaluation, key metrics for measuring security analytics efficacy, and tips to scale anomaly detection systems in the cloud.

Edd Wilder-James is a technology analyst, writer, and entrepreneur based in California. He’s helping transform businesses with data as VP of strategy for Silicon Valley Data Science. Formerly Edd Dumbill, Edd was the founding program chair for the O’Reilly Strata conferences and chaired the Open Source Convention for six years. He was also the founding editor of the peer-reviewed journal Big Data. A startup veteran, Edd was the founder and creator of the Expectnation conference-management system and a cofounder of the Pharmalicensing.com online intellectual-property exchange. An advocate and contributor to open source software, Edd has contributed to various projects such as Debian and GNOME and created the DOAP vocabulary for describing software projects. Edd has written four books, including O’Reilly’s Learning Rails.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business.

The business case for deep learning, Spark, and friends Tutorial

Deep learning is white-hot at the moment, but why does it matter? Developers are usually the first to understand why some technologies cause more excitement than others. Edd Wilder-James relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring.

Cack Wilhelm is a principal at Scale Venture Partners, where she focuses on investments in early-stage software companies, with an eye toward those helping businesses better utilize data, automate workflows, incorporate AI, and build more resilient software. Looking further ahead, Cack is watching closely as platforms such as virtual reality and augmented reality take shape. Cack cut her teeth selling 11g databases at Oracle and Hadoop clusters at Cloudera in the months before Hadoop reached Version 1.0. Cack has since transferred that operational and go-to-market experience into helping Scale portfolio companies such as Treasure Data, Realm, and CircleCI. Cack was initially drawn to the technology and software sectors while at Montgomery & Company, where she helped advise on software acquisitions and capital raises. Cack holds an MBA from the University of Chicago Booth School of Business and a BA from Princeton University. She is also a dedicated runner, racing professionally for Nike for two years at the 5,000m distance and competing in 12 of 12 athletic seasons at Princeton (cross-country, indoor track, and outdoor track). While at Princeton, she was named the Princeton Female Athlete of the Year and was a seven-time All-America recipient.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Ian Wrigley has taught tens of thousands of students over the last 25 years in subjects ranging from C programming to Hadoop development and administration. Ian is currently the director of education services at Confluent, where he heads the team building and delivering courses focused on Apache Kafka and its ecosystem.

Presentations

Building real-time data pipelines with Apache Kafka Tutorial

Ian Wrigley demonstrates how Kafka Connect and Kafka Streams can be used together to build real-world, real-time streaming data pipelines. Using Kafka Connect, you'll ingest data from a relational database into Kafka topics as the data is being generated and then process and enrich the data in real time using Kafka Streams before writing it out for further analysis.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud strategy and solutions. Before joining Cloudera, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Presentations

A deep dive into leveraging cloud infrastructure for data engineering workloads Session

Cloud infrastructure, with a scalable data store and elastic compute, is particularly well suited for large-scale data engineering workloads. Andrei Savu and Jennifer Wu explore the latest cloud technologies and outline cost, security, and ease-of-use considerations for data engineers.

Deploying and operating big data analytic apps on the public cloud Tutorial

Jennifer Wu, Eugene Fratkin, Andrei Savu, and Tony Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Tony Wu is a team lead of the Partner Enablement Cloud Hardware Infrastructure and Platform (CHIP) team at Cloudera, which is responsible for Microsoft Azure integration for Cloudera Director. Tony focuses on integrating partner solutions (cloud and hardware) with Cloudera software. He is also part of the team responsible for the EMC DSSD integration with Cloudera’s Distribution of Hadoop (CDH) and Cloudera Manager (CM).

Presentations

Deploying and operating big data analytic apps on the public cloud Tutorial

Jennifer Wu, Eugene Fratkin, Andrei Savu, and Tony Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Yinglian Xie is the CEO and cofounder of DataVisor, a startup in the area of big data analytics for security. Yinglian has been working in the area of internet security and privacy for over 10 years and has helped improve the security of billions of online users. Her work combines parallel-computing techniques, algorithms for mining large datasets, and security-domain knowledge into new solutions that prevent and combat a wide variety of attacks targeting consumer-facing online services. Prior to DataVisor, Yinglian was a senior researcher at Microsoft Research Silicon Valley, where she shipped a series of new techniques in production. She has been widely published in top conferences and served on the committees of many of them. Yinglian holds a PhD in computer science from Carnegie Mellon University.

Presentations

Don’t sleep on sleeper cells: Using big data to drive detection Session

How many of your users are really fraudsters waiting to strike? These sleeper cells exist in all online communities. Using data from more than 400M users and 500B events from online services across the world, Yinglian Xie explores sleeper cells, explains sophisticated attack techniques being used to evade detection, and shows how Spark's in-memory big data security analytics can help.

Reynold Xin is a cofounder and chief architect at Databricks as well as an Apache Spark PMC member and release manager for Spark’s 2.0 release. Prior to Databricks, Reynold was pursuing a PhD at the UC Berkeley AMPLab, where he worked on large-scale data processing.

Presentations

A behind-the-scenes look into Spark's API and engine evolutions Session

Reynold Xin looks back at the history of data systems, from filesystems, databases, and big data systems (e.g., MapReduce) to "small data" systems (e.g., R and Python), covering the pros and cons of each, the abstractions they provide, and the engines underneath. Reynold then shares lessons learned from this evolution, explains how Spark is developed, and offers a peek into the future of Spark.

Tony Xing is a senior product manager on the Shared Data team within Microsoft’s Application and Service group. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service group. Tony recently gave a talk at Strata + Hadoop World in Beijing.

Presentations

The common anomaly detection platform at Microsoft Session

Tony Xing offers an overview of Microsoft's common anomaly detection platform, an API service built internally to provide product teams the flexibility to plug in any anomaly detection algorithms to fit their own signal types.

Chandhu Yalla is senior engineering manager and head of big data/analytics competency at Intel IT, where he oversees Intel’s big data service and engineering and development and is responsible for spearheading platform design, architecture, implementation, capability guidance development, and application development. With 19 years’ experience in IT, Chandhu is regarded as a thought leader and expert in business intelligence architecture. He is a recipient of the Manager Excellence award and an inductee into the IT Engineering Hall of Fame. Chandhu has presented at numerous events and coauthored several IT@Intel and IIA white papers. Chandhu holds an MS in information technology from QUT, Australia.

Presentations

When big data leads to big results (sponsored by Paxata) Session

Thousands of companies have made their initial investments into next-generation data lake architecture, and they are on the verge of generating quality business returns. Chandhu Yalla and Neshad Bardoliwalla explain how enterprises have unlocked tangible value from their data lakes with adaptive information management and how their organizations are providing self-service to business units.

David Yan is an Apache Apex PMC member and an architect at DataTorrent. Previously, David worked on the Ad Systems, Yahoo Finance, and del.icio.us groups at Yahoo and the Artificial Intelligence group at the Jet Propulsion Laboratory. David holds an MS in computer science from Stanford University and a BS in electrical engineering and computer science from the University of California, Berkeley.

Presentations

Developing streaming applications with Apache Apex Session

David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics. With Apex, you can build applications that scalably and reliably process their data with high throughput and low latency.

An expert in quantitative modeling with a strong background in finance, Jeffrey Yau has over 17 years of experience applying econometric, statistic, and mathematical modeling techniques to real-world challenges. As vice president of data science at Silicon Valley Data Science, Jeffrey has a passion for leading data science teams in finding innovative solutions to challenging business problems.

Presentations

Graph-based anomaly detection: When and how Session

Thanks to frameworks such as Spark's GraphX and GraphFrames, graph-based techniques are increasingly applicable to anomaly, outlier, and event detection in time series. Jeffrey Yau offers an overview of applying graph-based techniques in fraud detection, IoT processing, and financial data and outlines the benefits of graphs relative to other techniques.

Ting-Fang Yen is a research scientist at DataVisor, a startup providing big data security analytics for consumer-facing web and mobile sites. Ting-Fang holds a PhD in electrical and computer engineering from Carnegie Mellon University.

Presentations

Cloudy with a chance of fraud: A look at cloud-hosted attack trends Session

When it comes to visibility into account takeover, spam, and fake accounts, the cloud is making things hazy. Cloud-hosted attacks skirt IP blacklists and make fraudulent users seem like they are located somewhere they are not. Drawing on data from 500 billion events and 400 million user accounts, Ting-Fang Yen examines cloud-based attack trends across verticals and regions.

Mike Yoder is a software engineer at Cloudera who has worked on a variety of Hadoop security features and internal security initiatives. Most recently, he implemented log redaction and the encryption of sensitive configuration values in Cloudera Manager. Prior to Cloudera, he was a security architect at Vormetric.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Mark Donsky, André Araujo, Michael Yoder, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Tristan Zajonc is a senior engineering manager at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and an MPA in international development from Harvard and a BA in economics from Pomona College.

Presentations

Making self-service data science a reality Session

Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality.

Ethan Zhang is a software engineer at VoltDB. Ethan is working toward a PhD at the University of Houston, where his research focuses on parallel database systems and cubes.

Presentations

Continuous queries over high-velocity event streams using an in-memory database (sponsored by VoltDB) Session

Continuous queries on streaming data play a vital role in fast data applications, providing always up-to-date results based on the most recent data. Ethan Zhang offers an overview of VoltDB, a NewSQL distributed database that supports continuous queries three orders of magnitude faster with materialized views, highlighting a transparent, automatic, and incremental-view maintenance approach.

Hang Zhang is a senior data science manager on the Algorithm and Data Science team in the Data group at Microsoft, where his major focus is on team data science processes and the Cortana Intelligence Competition Platform. Previously, Hang was a staff data scientist at WalmartLabs in charge of internal business intelligence tools and a senior data scientist at Opera Solutions. He is a senior member of the IEEE. Hang holds a PhD in industrial and systems engineering and an MS in statistics from Rutgers University.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Xiatian Zhang is responsible for mobile big data mining and ML algorithm research and implementation at TalkingData. Xiatian has long engaged in data mining and machine-learning research and has dozens of research papers in publication and sufficient patents. Previously, he worked for IBM’s China Research Institute, the Tencent data platform, and Huawei’s Noah’s Ark Lab.

Presentations

Fregata: TalkingData's lightweight, large-scale machine-learning library on Spark (sponsored by TalkingData) Session

Large-scale machine learning is a big challenge in industry due to the huge computing resources required and the difficulty of parameter tuning. Xiatian Zhang offers an overview of Fregata, TalkingData's open source machine-learning library based on Spark, which provides a lightweight, fast, memory-efficient, and parameter-free solution for large-scale machine learning.

Mengyue Zhao is a data scientist at Microsoft, where she develops end-to-end machine-learning solutions for various use cases in cloud computing and distributed platforms (e.g., Azure, Hadoop, and Spark). Mengyue focuses on scalable analysis, including data processing, feature engineering, feature selection, predictive modeling, and web services development. Previously, she was a data analyst at GE Digital, mainly focusing on solving machine-learning problems in the manufacturing domain. Mengyue has broad interests in machine learning, deep learning, and data mining and is passionate about harnessing the power of big data to answer interesting questions and drive business decisions. Mengyue holds a master’s degree in analytics from the University of San Francisco.

Presentations

Using R for scalable data analytics: From single machines to Hadoop Spark clusters Tutorial

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Alice Zheng leads the machine learning optimization team on Amazon’s advertising platform. She specializes in research and development of machine learning methods, tools, and applications. Outside of work, she is writing a book, Mastering Feature Engineering. Previously, Alice worked at GraphLab/Dato/Turi, where she led the machine learning toolkits team and spearheaded user outreach. Prior to joining GraphLab, she was a researcher in the Machine Learning group at Microsoft Research, Redmond. Alice holds PhD and BA degrees in computer science and a BA in mathematics, all from UC Berkeley.

Presentations

Feature engineering for diverse data types Session

In the machine-learning pipeline, feature engineering takes up the majority amount of time yet is seldom discussed. Alice Zheng leads a tour of popular feature engineering methods for text, logs, and images, giving you an intuitive and actionable understanding of tricks of the trade.

As vice president of products at Trifacta, Wei Zheng combines her passion for technology with experience in enterprise software to define and shape Trifacta’s product offerings. Having founded several startups of her own, Wei believes strongly in innovative technology that solves real-world business problems. Most recently, she led product management efforts at Informatica, where she helped launch several new solutions including its Hadoop and data-virtualization products.

Presentations

Why the next wave of data lineage is driven by automation, visualization, and interaction Session

Sean Kandel and Wei Zheng offer an overview of an entirely new approach to visualizing metadata and data lineage, demonstrating automated methods for detecting, visualizing, and interacting with potential anomalies in reporting pipelines. Join in to learn what’s required to efficiently apply these techniques to large-scale data.

Chao Zhong is a senior data scientist at C+E Analytics and Insights within Microsoft. His current research interests include (deep) machine learning for customer journey and customer lifetime value and (deep) reinforcement learning for interactive customer behavior modeling. Previously, Chao was the lead data scientist at Scopely, a mobile gaming company in LA. Chao was an ABD (all but dissertation) PhD candidate in mathematics at Michigan Technological University. He holds an MS degree in financial engineering from Temple University and a BS degree in computer science from Beijing University of Aeronautics and Astronautics.

Presentations

Predicting customer lifetime value for a subscription-based business Session

Chao Zhong offers an overview of a new predictive model for customer lifetime value (LTV) in a cloud-computing business. This model is also the first known application of the Fader RFM approach to a cloud business—a Bayesian approach that predicts a customer's LTV with a symmetric absolute percentage error (SAPE) of only 3% on an out-of-time testing dataset.

Feng Zhu is a data scientist at C+E Analytics and Insights within Microsoft, where he focuses on building end-to-end solutions for various problems in the Microsoft Cloud business using advanced machine-learning techniques. Previously, Feng was a research scientist on the Fraud Detection and Risk Management team at Amazon, where he collaborated with various business and engineering teams to provide fraud detection and mitigation solutions for the Pay with Amazon product. He holds a PhD in electrical engineering and MS degrees in electrical engineering and applied mathematics from the University of Notre Dame and a BS from Harbin Institute of Technology, China.

Presentations

How Microsoft predicts churn of cloud customers using deep learning and explains those predictions in an interpretable way Session

Although deep learning has proved to be very powerful, few results are reported on its application to business-focused problems. Feng Zhu and Val Fontama explore how Microsoft built a deep learning-based churn predictive model and demonstrate how to explain the predictions using LIME—a novel algorithm published in KDD 2016—to make the black box models more transparent and accessible.

John Zhuge is a software engineer at Cloudera focusing on Hadoop HDFS and Hadoop Compatible File Systems for cloud. As an active Apache Hadoop Committer, he contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, John designed and implemented filesystems and protocols for storage systems. John holds seven US patents.

Presentations

Running a Cloudera cluster in production on Azure Session

Paige Liu and John Zhuge explore the options and trade-offs to consider when building a Cloudera cluster on Microsoft Azure Cloud and explain how to deploy and scale a Cloudera cluster on Azure and how to connect a Cloudera cluster with other Azure services to build enterprise-grade end-to-end big data solutions.