Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Justin Bleich is a senior data scientist at Coatue Management. Previously, Justin was the cofounder and CTO of Zodiac, an artificial intelligence startup that focused on predicting customer behavior to help brands retain their best customers and find more like them, and an adjunct professor at the Wharton School at the University of Pennsylvania, where he taught advanced data mining and predictive modeling. Justin holds a PhD in statistics from the Wharton School, where he focused on Bayesian machine learning and ensemble-of-trees algorithms.

Presentations

Probabilistic programming in finance using Prophet Session

Prophet is a Bayesian nonlinear time series forecasting model recently released by Facebook. Justin Bleich explains how Coatue—a hedge fund that uses data science to drive investment decisions—extends Prophet to include exogenous covariates when generating forecasts and applies it to nowcasting macroeconomic series using higher-frequency data available from sources such as Google Trends.

Ashvin Agrawal is a senior research engineer at Microsoft, where he works on streaming systems and contributes to the Twitter Heron project. Ashvin is a software engineer with more than 10+ years experience. He specializes in developing large-scale distributed systems. Previously, he worked at VMware, Yahoo, and Mojo Networks. Ashvin holds an MTech in computer science from IIT Kanpur, India.

Presentations

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and insights on how to address them.

Manish Ahluwalia is a software engineer at Cloudera, where he focuses on security of the Hadoop ecosystem. Manish has been working in big data since its infancy in various companies in Silicon Valley. He is most passionate about security.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Foundations of streaming SQL; or, How I learned to love stream and table theory Session

What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Managing data science in the enterprise Tutorial

John Akred and Heather Nelson share methods and observations from three years of effectively deploying data science in enterprise organizations. You'll learn how to build, run, and get the most value from data science teams and how to work with and plan for the needs of the business.

Brendan Aldrich is leading data modernization and democratization initiatives as the chief data officer for Ivy Tech Community College, the largest singly accredited community college in the nation, educating nearly 175,000 students each year across the state of Indiana.

A cross-industry data innovations specialist, Aldrich has over 20 years of information technology experience at companies like the Walt Disney Company, Demand Media, Travelers Insurance and the City Colleges of Chicago, where he has repeatedly built and led top-performing teams that have transformed the enterprise.

He holds a bachelor’s degree from California State University, Los Angeles and his groundbreaking work at City Colleges of Chicago and Ivy Tech has been recognized with both a 2014 Innovators Award from Campus Technology magazine and Gartner’s 2017 Data and Analytics Excellence Award.

Presentations

Learning from higher education: How Ivy Tech is using predictive analytics and a data democracy to reverse decades of entrenched practices Session

As the largest community college in the US, Ivy Tech ingests over 100M rows of data a day. Brendan Aldrich and Lige Hensley explain how Ivy Tech is applying predictive technologies to establish a true data democracy—a self-service data analytics environment empowering thousands of users each day to improve operations, achieve strategic goals, and support student success.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time systems with Spark Streaming and Kafka 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time Systems with Spark Streaming and Kafka (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

The five dysfunctions of a data engineering team Session

Early project success is predicated on management making sure a data engineering team is ready and has all of the skills needed. Jesse Anderson outlines five of the most common nontechnology reasons why data engineering teams fail.

Assaf Araki is the senior architect for big data analytics at Intel, where his group is responsible for big data analytics path findings within the company. Assaf drives the overall work with the academy and industry for big data analytics and merges new technologies inside Intel Information Technology. He has over 10 years of experience in data warehousing, decision support solutions, and applied analytics within Intel.

Presentations

Hardcore Data Science welcome HDS

Hosts Ben Lorica and Assaf Araki welcome you to Hardcore Data Science day.

André Araujo is a solutions architect with Cloudera. Previously, he was an Oracle database administrator. An experienced consultant with a deep understanding of the Hadoop stack and its components, André is skilled across the entire Hadoop ecosystem and specializes in building high-performance, secure, robust, and scalable architectures to fit customers’ needs. André is a methodical and keen troubleshooter who loves making things run faster.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Tasso Argyros is currently the Founder & CEO of ActionIQ, an Enterprise Software that aims to bridge the gap between Marketing & Data for the Global 2000 leaders. ActionIQ already counts some of the largest Enterprises of the world as its clients in the telco, retail and other verticals.

Prior to ActionIQ, Tasso co-founded Aster Data, an early Big Data pioneer, in 2005 after dropping out of the PhD program at Stanford. After 5 years of strong growth, Aster Data was sold to Teradata in 2011 where Tasso became the co-president and GM of Teradata’s Big Data division. He is also a Venture Partner at First Mark capital and a co-founder of DataElite Ventures, a San Francisco-based seed-stage fund focused on Big Data companies.
Tasso has received several awards and recognitions, including BusinessWeek’s “Best Young Tech Entrepreneur” for 2009, World Economic Forum’s “Technology Pioneer” in 2010, and Forbes’ “NextGen Innovator” in 2013. He holds a Master’s Degree in Computer Science from Stanford University and a Diploma in Computer Engineering from Technical University of Athens.

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

Eduardo Arino de la Rubia is chief data scientist at Domino Data Lab. Eduardo is a lifelong technologist with a passion for data science who thrives on effectively communicating data-driven insights throughout an organization. He is a graduate of the MTSU Computer Science department, General Assembly’s Data Science program, and the Johns Hopkins Coursera Data Science specialization. Eduardo is currently pursuing a master’s degree in negotiation, conflict resolution, and peacebuilding from CSUDH. You can follow him on Twitter as @earino.

Presentations

Leveraging open source automated data science tools Session

The promise of the automated statistician is as old as statistics itself. Eduardo Arino de la Rubia explores the tools created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation. Along the way, Eduardo compares open source tools such as TPOT and auto-sklearn and discusses their place in the DS workflow.

Carme Artigas is the founder and CEO of Synergic Partners, a strategic and technological consulting firm specializing in big data and data science (acquired by Telefónica in November 2015). She has more than 20 years of extensive expertise in the telecommunications and IT fields and has held several executive roles in both private companies and governmental institutions. Carme is a member of the Innovation Board of CEOE and the Industry Affiliate Partners at Columbia University’s Data Science Institute. An in-demand speaker on big data, she has given talks at several international forums, including Strata + Hadoop World, and collaborates as a professor in various master’s programs on new technologies, big data, and innovation. Carme was recently recognized as the only Spanish woman among the 30 most influential women in business by Insight Success. She holds an MS in chemical engineering and an MBA from Ramon Llull University in Barcelona and an executive degree in venture capital from UC Berkeley’s Haas School of Business.

Presentations

Executive Briefing: Analytics centers of excellence as a way to accelerate big data adoption by business Session

Big data technology is mature, but its adoption by business is slow, due in part to challenges like a lack of resources or the need for a cultural change. Carme Artigas explains why an analytics center of excellence (ACoE), whether internal or outsourced, is an effective way to accelerate the adoption and shares an approach to implementing an ACoE.

Shivnath Babu is the chief scientist CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Presentations

Using ML to solve failure problems with ML and AI apps in Spark Session

A roadblock in the agility that comes with Spark is that application developers can get stuck with application failures and have a tough time finding and resolving the issue. Adrian Popescu and Shivnath Babu explain how to use the root cause diagnosis algorithm and methodology to solve failure problems with ML and AI apps in Spark.

Josh Baer has been a data infrastructure product lead at Spotify since 2013. He’s worked on growing Spotify’s Hadoop footprint from 180 machines to 2,000, enabling everyday real-time processing and providing infrastructure for advanced machine learning tasks.

Right now, Josh is leading the data processing track of Spotify’s migration to Google Cloud Platform.

Presentations

Spotify in the cloud: The next evolution of data at Spotify Session

In early 2016, Spotify decided that it didn’t want to be in the data center business. The future was the cloud. Josh Baer and Alison Gilles share Spotify's story and explain what it takes to move to the cloud, covering Spotify's technology choices, challenges faced, and the lessons Spotify learned along the way.

Travis Bakeman is a senior manager of systems design and strategy at T-Mobile, with a focus on network performance management and big data analytics, where he is responsible for multiple teams that deliver enterprise solutions leveraging off the shelf options such as Splunk, Oracle RAC, and open source technologies like Cloudera Hadoop. During his tenure with T-Mobile, he has worked in operational support, database administration, data mediation, report development, data enrichment, and frontend application design. Previously, Travis worked in military intelligence in the United States Army. He started his career in the telecom industry in data center operations.

Presentations

How T-Mobile built a massive-scale network performance management platform on Hadoop Session

Travis Bakeman shares how T-Mobile ported its large-scale network performance management platform, T-PIM, from a legacy database to a big data platform with Impala as the main reporting interface, covering the migration journey, including the challenges the team faced, how the team evaluated new technologies, lessons learned along the way, and the efficiencies gained as a result.

Michael Balint is a senior manager of applied solutions engineering at NVIDIA. Previously, Michael was a White House Presidential Innovation Fellow, where he brought his technical expertise to projects like Vice President Biden’s Cancer Moonshot program and Code.gov. Michael has had the good fortune of applying software engineering and data science to many interesting problems throughout his career, including tailoring genetic algorithms to optimize air traffic, harnessing NLP to summarize product reviews, and automating the detection of melanoma via machine learning. He is a graduate of Cornell and Johns Hopkins University.

Presentations

Training a deep learning risk detection platform Session

Joshua Patterson and Michael Balint explain how to bootstrap a deep learning framework to detect risk and threats in production operational systems using best-of-breed GPU-accelerated open source tools.

Kirit Basu is director of product management at StreamSets.

Presentations

Real-time image classification: Using convolutional neural networks on real-time streaming data Session

Enterprises building data lakes often have to deal with very large volumes of image data that they have collected over the years. Josh Patterson and Kirit Basu explain how some of the most sophisticated big data deployments are using convolutional neural nets to automatically classify images and add rich context about the content of the image, in real-time, while ingesting data at scale.

Dominikus Baur works to make the form of data accessible in every situation. As a data visualization and mobile interaction designer and developer, Dominikus is creating usable, aesthetic, and responsive visualizations for desktops, tablets and smartphones. As a freelancer, he has helped create beautiful visualizations for clients including the OECD, Microsoft Research, and Wincor Nixdorf. As a trainer for data visualization development, he holds workshops providing both a scientific and practical background. Dominikus is a regular speaker at various academic and industry conferences. He holds a PhD in media informatics from the University of Munich (Ludwig-Maximilians-Universität), where his research focused on making our growing personal databases of media, status updates and messages manageable.

Presentations

Data futures: Exploring the everyday implications of increasing access to our personal data Session

Increasing access to our personal data raises profound moral and ethical questions. Daniel Goddemeyer and Dominikus Baur share the findings from Data Futures, an MFA class in which students observed each other through their own data, and demonstrate the results with a live experiment with the audience that showcases some of the effects when personal data becomes accessible.

Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services, where he focuses on AI and real-time streaming technologies and working with AWS customers to build data-driven products (whether batch or real time) and create solutions powered by ML in the cloud. Roy has worked in the data and analytics industry for over a decade and has helped hundreds of customers bring compelling data-driven products to the market. He serves on the advisory board of Applied Mathematics and Data Science at Post University in Connecticut. Roy holds a BSc in information systems and an MBA from the University Of Georgia.

Presentations

Creating a serverless real-time analytics platform powered by machine learning in the cloud Session

Speed matters. Today, decisions are made based on real-time insights, but in order to support the substantial growth of streaming data, companies are required to innovate. Roy Ben-Alta and Allan MacInnis explore AWS solutions powered by machine learning and artificial intelligence.

Tim Berglund is a teacher, author, and technology leader with DataStax. He has spoken at numerous conferences internationally and in the United States and contributes to the Denver tech community as president of the Denver Open Source User Group. He is the copresenter of various O’Reilly training videos on topics ranging from Git to Mac OS X productivity tips to Apache Cassandra and is the author of Gradle Beyond the Basics. Tim blogs very occasionally at Timberglund.com. He lives in Littleton, Colorado, with the wife of his youth and their three children.

Presentations

Heraclitus, enterprise architecture, and streaming data Session

As the Greek philosopher Heraclitus famously noted, you never step into the same river twice. Almost as famous as Heraclitus is Apache Kafka, the de facto standard open source distributed stream processing system. Tim Berglund shares several real-world systems that use Kafka not just as a giant message queue but as a platform for distributed stream computation.

Ron Bodkin is the CTO of architecture and services at Teradata, where he is responsible for leading the global emerging technology team focusing on artificial intelligence, GPUs, and the blockchain and leading global consulting teams for enterprise analytics architectures combining Hadoop and Spark, the public cloud, and traditional data warehousing, as well as driving Teradata’s strategic pillar. Previously, Ron was the founding CEO of Think Big Analytics (acquired by Teradata in 2014), which provides end-to-end support for enterprise big data, including data science, data engineering, advisory and managed services, and frameworks such as Kylo for enterprise data lakes; VP of engineering at Quantcast, where he led the data science and engineer teams that pioneered the use of Hadoop and NoSQL for batch and real-time decision making; founder of New Aspects, which provided enterprise consulting for aspect-oriented programming; and cofounder and CTO of B2B applications provider C-Bridge, where he led a team of 900 people to a successful IPO. Ron holds a BS in math and computer science with honors from McGill University and a master’s degree in computer science from MIT, where he was also pursuing a PhD. He left the program after his idea for C-Bridge placed in the finals of MIT’s $50K Entrepreneurship Contest.

Presentations

Deep learning for recommender systems Tutorial

Ron Bodkin and Mo Patel demonstrate how to apply deep learning to improve consumer recommendations by training neural nets to learn categories of interest for recommendations using embeddings. You'll also learn how to achieve wide and deep learning with WALS matrix factorization—now used in production for the Google Play store.

Fighting financial fraud at Danske Bank with artificial intelligence Session

Fraud in banking is an arms race with criminals using machine learning to improve their attack effectiveness. Ron Bodkin and Nadeem Gulzar explore how Danske Bank uses deep learning for better fraud detection, covering model effectiveness, TensorFlow versus boosted decision trees, operational considerations in training and deploying models, and lessons learned along the way.

Charles Boicey is the chief innovation officer for Clearsense, a healthcare analytics organization specializing in bringing big data technologies to healthcare. Previously, Charles was the enterprise analytics architect for Stony Brook Medicine, where he developed the analytics infrastructure to serve the clinical, operational, quality, and research needs of the organization. He was a founding member of the team that developed the Health and Human Services award-winning application NowTrending to assist in the early detection of disease outbreaks by utilizing social media feeds. Charles is a former president of the American Nursing Informatics Association.

Presentations

Spark clinical surveillance: Saving lives and improving patient care Session

Charles Boicey explains how Clearsense uses Spark Streaming to provide real-time updates to healthcare providers for critical healthcare needs, helping clinicians make timely decisions from the assessment of a patient's risk based on information gathered from streaming physiological monitoring along with streaming diagnostic data and the patient historical record.

Matt Bolte is a technical expert at Walmart. Matt has 19 years’ IT experience, five of them working with large secure enterprise Hadoop clusters.

Presentations

An authenticated journey through big data security at Walmart Session

In today’s world of data breaches and hackers, security is one of the most important components for big data systems, but unfortunately, it's usually the area least planned and architected. Matt Bolte and Toni LeTempt share Walmart's authentication journey, focusing on how decisions made early can have significant impact throughout the maturation of your big data environment.

Tobi Bosede is a machine learning engineer. She has taught R programming at Johns Hopkins University and Python programming for General Assembly. Tobi’s professional work spans multiple industries, from telecom at Sprint to finance at JPMorgan. She holds a bachelor’s degree in mathematics from the University of Pennsylvania and a master’s in applied mathematics and statistics from Johns Hopkins University.

Presentations

Big data analysis of futures trades Session

Whether an entity seeks to create trading algorithms or mitigate risk, predicting trade volume is an important task. Focusing on futures trading that relies on Apache Spark for processing the large amount data, Tobi Bosede considers the use of penalized regression splines for trade volume prediction and the relationship between price volatility and trade volume.

danah boyd is the founder and president of Data & Society, a research institute focused on understanding the role of data-driven technologies in society, a principal researcher at Microsoft Research, and a visiting professor in NYU’s Interactive Telecommunications Program. danah’s research focuses on the intersection of technology, society, and policy. She is currently doing work on questions related to bias in big data and artificial intelligence, how people negotiate privacy and publicity, and the social ramifications of using data in education, criminal justice, labor, and public life. For over a decade, she examined how American youth incorporate social media into their daily practices in light of different fears and anxieties that the public has about young people’s engagement with technologies like MySpace, Facebook, Twitter, YouTube, Instagram, and texting. She has researched a plethora of teen issues, ranging from privacy to bullying, racial inequality, and sexual identity. Her early findings were published in Hanging Out, Messing Around, and Geeking Out: Kids Living and Learning with New Media. Her 2014 monograph, It’s Complicated: The Social Lives of Networked Teens, has received widespread praise from scholars, parents, and journalists and has been translated into seven languages. This work was funded by both the MacArthur Foundation and Microsoft Research. Her most recent collaborative book project, Participatory Culture in a Networked Era, with Mimi Ito and Henry Jenkins, reflects on how digital participation has shaped different parts of society. Her work has been profiled by numerous publications, including the New York Times, Fast Company, the Boston Globe, and Forbes, and published in a wide range of scholarly venues.

In 2010, danah won the CITASA Award for Public Sociology. The Financial Times dubbed her “the high priestess of internet friendship,” Fortune magazine identified her as the smartest academic in tech, and Technology Review named her one of 2010’s young innovators under 35. danah was a 2011 Young Global Leader of the World Economic Forum and is a member of the Council on Foreign Relations. She is a director of both Crisis Text Line and the Social Science Research Council and a trustee of the National Museum of the American Indian. She sits on advisory boards for Electronic Privacy Information Center, Brown University’s Department of Computer Science, and the School of Information at the University of Michigan. She was a commissioner on the 2008–2009 Knight Commission on Information Needs of Communities in a Democracy. From 2009 to 2013, danah served on the World Economic Forum’s Social Media Global Agenda Council. At the Berkman Center, she codirected the Internet Safety Technical Task Force in 2008 with John Palfrey and Dena Sacco to work with companies and nonprofits to identify potential technical solutions for keeping children safe online. More recently, she codirected the Youth Media and Policy Working Group with John Palfrey and Urs Gasse, funded by the MacArthur Foundation from 2009 to 2011. In 2012, she and John Palfrey also helped the Born This Way Foundation and the MacArthur Foundation develop a research strategy to help empower youth to address meanness and cruelty. She is one of the hosts of the annual Data & Civil Rights Conference. Since 2015, she has also served on the US Commerce Department’s Data Advisory Council. She also created and managed a large online community for V-Day, a nonprofit organization working to end violence against women and girls worldwide. She has advised numerous other companies, sits on corporate, education, conference, and nonprofit advisory boards, and regularly speaks at a wide variety of conferences and events. danah holds a bachelor’s degree in computer science from Brown University (under Andy van Dam), a master’s degree in sociable media from MIT Media Lab (under Judith Donath), and a PhD in information from the University of California, Berkeley i(under Peter Lyman and Mimi Ito). She has worked as an ethnographer and social media researcher for various corporations, including Intel, Tribe.net, Google, and Yahoo. She blogs at zephoria.org/thoughts/ and tweets as @zephoria.

Presentations

Keynote with danah boyd Keynote

Keynote with danah boyd

David Boyle leads the work of the Insight team at BBC Worldwide, the commercial and global wing of the BBC, where he helps to transform the relationship that BBC Worldwide has with its audience by building premium, industry-leading insight capabilities into consumers, BBC brands, and the market and determine what connects with audiences emotionally and inspires them. David has spent the last seven years constructing global insight capabilities for the publishing and music industries, which were widely acknowledged as having helped them make quicker, smarter, and bolder decisions for their brands. Previously, he was SVP of consumer insight at HarperCollins Publishers, where he helped the company better understand consumer behavior and attitudes toward books, authors, book discovery, and purchase, and worked at EMI Music, where he delivered insight to all parts of the business in more than 25 countries and helped to shift the organization’s decision making at all levels, from artist signing to product and brand development plans for EMI’s biggest artists, including the Beatles and Pink Floyd.

Presentations

From the weeds to the stars: How and why to think about bigger problems Session

Too many brilliant analytical minds are wasted on interesting but ultimately less-impactful problems. They are stuck in the weeds of the data or the challenges of our day to day. Too few ask what it means to reach for the stars—the big, shiny, business-changing issues. David Boyle explains why you must start asking bigger questions and making a bigger difference.

Katherine Boyle is an investor at General Catalyst, an early-stage venture capital firm with $3.7Bn under management. She focuses on investments in highly-regulated industries, including government, defense, aerospace and autonomous mobility. Before becoming an investor, she was a staff reporter at The Washington Post covering creative industries, consumer retail, government accountability and weird subcultures, the latter preparing her most for a career in venture capital.

Katherine received an MBA from Stanford Graduate School of Business, where she was research assistant to Dr. Condoleezza Rice for her course and upcoming book “Managing Political Risk.” She’s a graduate of Georgetown University and holds a masters degree in public advocacy from National University of Ireland, Galway.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Claudiu Branzan is the director of data science at G2 Web Services where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine-learning and distributed-systems experience. Previously, Claudiu worked for Atigeo Inc, building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with spaCy, Spark ML, and TensorFlow Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, TensorFlow for training custom machine-learned annotators, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Richard Brath is a partner at Uncharted Software. Richard has been designing and building innovative information visualizations for 20 years, ranging from one of the first interactive 3D financial visualizations on the web in 1995 to visualizations embedded in financial data systems used every day by thousands of market professionals. Richard is pursuing a PhD in new data visualization techniques at LSBU.

Presentations

Text analytics and new visualization techniques Session

Text analytics are advancing rapidly, and new visualization techniques for text are providing new capabilities. Richard Brath offers an overview of these new ways to organize massive volumes of text, characterize subjects, score synopses, and skim through lots documents.

Mikio Braun is delivery lead for recommendation and search at Zalando, one of the biggest European fashion platforms. Mikio holds a PhD in machine learning and worked in research for a number of years before becoming interested in putting research results to good use in the industry.

Presentations

Deep learning in practice Session

Deep learning has become the go-to solution for many application areas, such as image classification or speech processing, but does it work for all application areas? Mikio Braun offers background on deep learning and shares his practical experience working with these exciting technologies.

Tamara Broderick is the ITT Career Development Assistant Professor in the Department of Electrical Engineering and Computer Science at MIT. Tamara’s recent research is focused on developing and analyzing models for scalable Bayesian machine learning, especially Bayesian nonparametrics. She is a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), the MIT Statistics and Data Science Center, and the Institute for Data, Systems, and Society (IDSS). Tamara has been awarded a Google faculty research award, the ISBA Lifetime Members Junior Researcher Award, the Savage Award (for an outstanding doctoral dissertation in Bayesian theory and methods), the Evelyn Fix Memorial Medal and Citation (for the PhD student on the Berkeley campus showing the greatest promise in statistical research), the Berkeley fellowship, an NSF Graduate Research Fellowship, a Marshall Scholarship, and the Phi Beta Kappa Prize (for the graduating Princeton senior with the highest academic average). She holds a PhD in statistics from the University of California, Berkeley, completed under Michael I. Jordan, an AB in mathematics from Princeton University, a master of advanced study for completion of Part III of the Mathematical Tripos from the University of Cambridge, an MPhil by research in physics from the University of Cambridge, and an MS in computer science from the University of California, Berkeley.

Presentations

Session with Tamara Broderick HDS

Session with Tamara Broderick

Kalah Brown is a senior Hadoop engineer at Big Fish Games, where she is responsible for the technical leadership and development of Big Data Solution. Previously, Kalah was a consultant in the greater Seattle area and worked with numerous companies including Disney, Starbucks, the Bill and Melinda Gates Foundation, Microsoft, and Premera Blue Cross. She has 17 years of experience in software development, data warehousing, and business intelligence.

Presentations

Working within the Hadoop ecosystem to build a live-streaming data pipeline Session

Companies are increasingly interested in processing and analyzing live-streaming data. The Hadoop ecosystem includes platforms and software library frameworks to support this work, but these components require correct architecture, performance tuning, and customization. Stephen Devine and Kalah Brown explain how they used Spark, Flume, and Kafka to build a live-streaming data pipeline.

Kurt Brown leads the Data Platform team at Netflix. Kurt’s group architects and manages the technical infrastructure underpinning the company’s analytics, which includes various big data technologies like Hadoop, Spark, and Presto, Netflix open-sourced applications and services such as Genie and Lipstick, and traditional BI tools including Tableau and Redshift.

Presentations

20 Netflix-style principles and practices to get the most out of your data platform Session

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they interact with the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.

Joanna J. Bryson is a transdisciplinary researcher on the structure and dynamics of human- and animal-like intelligence. Her research covers topics ranging from artificial intelligence through autonomy and robot ethics to human cooperation and has appeared in venues ranging from Reddit to Science. Joanna is a professor in the Department of Computer Science a the University of Bath, where she founded and for several years led the Intelligent Systems research group. Joanna is also affiliated with Bath’s Institutes for Policy Research and Mathematical Innovation, as well as their Centres for Networks and Collective Behaviour and for Digital Entertainment. She has held visiting academic positions with Princeton’s Center for Information Technology Policy (where she is still affiliated), the Mannheim Centre for Social Science Research, the Department of Anthropology at Oxford, where she worked on Harvey Whitehouse’s Explaining Religion project, the Methods & Data Institute at Nottingham, doing agent-based modeling in political science, and the Konrad Lorenz Institute for Evolution & Cognition Research in Austria, where she researched the biological origins of culture. She has conducted academic research in Edinburgh’s Human Communication Research Centre and Harvard’s Department of Psychology. Outside of academia, Joanna has worked in Chicago’s financial industry, international organization management consultancy, and industrial AI research. Joanna has served on the senate, council, and court for the University of Bath, representing the academic assembly. She is currently a member of the College of the British Engineering and Physical Sciences Research Council (EPSRC) and serves as a member of the editorial board for several academic journals, including Adaptive Behaviour, AI & Society, Connection Science, and the International Journal of Synthetic Emotions. Joanna holds a degree in behavioural science (nonclinical psychology) from Chicago, an MSc in artificial intelligence and an MPhil in psychology from Edinburgh, and a PhD in artificial intelligence from MIT.

Presentations

The real project of AI ethics Keynote

AI has been with us for hundreds of years; there's no "singularity" step change. Joanna Bryson explains that the main threat of AI is not that it will do anything to us but what we are already doing to each other with it—predicting and manipulating our own and others' behavior.

Brandon Bunker is the senior director of artificial intelligence at Vivint, where he and his team have developed the world’s first smart home assistant that truly understands home occupancy, helping Vivint’s customers save money, energy, and time. In the past year, he scaled Vivint’s Smart Assistant from 0 to 700,000+ customers and won an editor’s choice award at CES. Brandon is passionate about using new tools and techniques to create value from data. His specialties include the IoT, big data, data science, online marketing, direct marketing analytics, social analytics, mobile analytics, segmentation, and web analytics.

Presentations

How Vivint Smart Home made home security and automation even smarter with Tableau (sponsored by Tableau) Session

Brandon Bunker explains how Vivint delivers fast analytics from big data on a bootstrap budget by leveraging Tableau as a strategic piece of its modern BI architecture. By interactively analyzing data as it lands in its Cloudera Hadoop data lake, Vivint is able to deliver security across homes and data alike, making smart homes even smarter and saving customers money in the process.

Ellsworth (Ells) Campbell is a health scientist in the Laboratory branch of the Division of HIV/AIDS Prevention at the CDC. Ells began working at CDC as a PhD student and Oak Ridge Institute for Science Education (ORISE) fellow and recently transitioned to a full-time associate service fellowship. Ells holds bachelor’s and master’s degrees in biology from UC San Diego and is currently pursuing a PhD in biology at Penn State University.

Presentations

Tracking the opioid-fueled HIV outbreak with big data (sponsored by Trifacta) Session

Ells Campbell, Connor Carreras, and Ryan Weil explain how the Microbial Transmission Network Team (MTNT) at the Centers for Disease Control (CDC) is leveraging new techniques in data collection, preparation, and visualization to advance the understanding of the spread of HIV/AIDS.

Marc Carlson is a lead computational biologist in research informatics at Seattle Children’s Research Institute. Marc divides his time between helping architect new cloud-based infrastructure to serve the scientists at SCRI, working to make sure that new compute resources are brought online and properly configured for immediate utility, and helping users with their data and analysis needs via the Bioinformatics unit, the goal of which is to make sure that scientists at SCRI can learn the most from their data. Marc’s contributions include creating and running training courses, periodic consultations, and helping with the Bioinformatics user group. Previously, he held a postdoc in computational Biology at UCLA and worked on the Bioconductor core team at the Fred Hutchinson Cancer Research center, where he served the needs of the R based computational biology community. Marc holds a BS in genetics and cell biology from Washington State University and a PhD in developmental and cell biology from the UC Irvine.

Presentations

Project Rainier: Saving lives one insight at a time Session

Marc Carlson and Sean Taylor offer an overview of Project Rainier, which leverages the power of HDFS and the Hadoop and Spark ecosystem to help scientists at Seattle Children’s Research Institute quickly find new patterns and generate predictions that they can test later, accelerating important pediatric research and increasing scientific collaboration by highlighting where it is needed.

Connor Carreras is Trifacta’s manager for customer success in the Americas, where she helps customers use cutting-edge data wrangling techniques in support of their big data initiatives. Connor brings her prior experience in the data integration space to help customers understand how to adopt self-service data preparation as part of an analytics process. She is a coauthor of the O’Reilly book Principles of Data Wrangling.

Presentations

Tracking the opioid-fueled HIV outbreak with big data (sponsored by Trifacta) Session

Ells Campbell, Connor Carreras, and Ryan Weil explain how the Microbial Transmission Network Team (MTNT) at the Centers for Disease Control (CDC) is leveraging new techniques in data collection, preparation, and visualization to advance the understanding of the spread of HIV/AIDS.

Michelle Casbon is director of data science at Qordoba. Michelle’s development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. Previously, she was a senior data science engineer at Idibon, where she built tools for generating predictions on textual datasets. She loves working with open source projects and has contributed to Apache Spark and Apache Flume. Her writing has been featured in the AI section of O’Reilly Radar. Michelle holds a master’s degree from the University of Cambridge, focusing on NLP, speech recognition, speech synthesis, and machine translation.

Presentations

How machine learning with open source tools helps everyone build better products Session

Michelle Casbon explains the machine learning and natural language processing that enables teams to build products that feel native to every user and explains how Qordoba is tackling the underserved domain of localization using open source tools, including Kubernetes, Docker, Scala, Apache Spark, Apache Cassandra, and Apache PredictionIO (incubating).

Tanya Cashorali is the founding partner of TCB Analytics, a Boston-based data consultancy, and the chief data officer of Stattleship, a sports content and data business that connects brands with sports fans through social media. Previously, she worked as a data scientist at Biogen. Tanya started her career in bioinformatics and has applied her experience to other data-rich verticals such as telecom, finance, and sports. She brings over 10 years of experience using R in data scientist roles as well as managing and training data analysts, and she’s helped grow a handful of Boston startups.

Presentations

How to hire and test for data skills: A one-size-fits-all interview kit Session

Given the recent demand for data analytics and data science skills, adequately testing and qualifying candidates can be a daunting task. Interviewing hundreds of individuals of varying experience and skill levels requires a standardized approach. Tanya Cashorali explores strategies, best practices, and deceptively simple interviewing techniques for data analytics and data science candidates.

Sarah Catanzaro is an investor at Canvas Ventures, where she focuses on analytics, data infrastructure, and machine intelligence. Sarah has several years of experience in developing data acquisition strategies and leading machine and deep learning-enabled product development at organizations of various sizes. Most recently, she led the data team at Mattermark to collect and organize information on over one million private companies. Previously, she implemented analytics solutions for municipal and federal agencies as a consultant at Palantir and as an analyst at Cyveillance. She also led projects on adversary behavioral modeling and Somali pirate network analysis as a program manager at the Center for Advanced Defense Studies. Sarah holds a BA in international security studies from Stanford University.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Simon Chan is a senior director of product management for Salesforce Einstein, where he oversees platform development and delivers products that empower anyone to build smarter apps with Salesforce. Simon is a product innovator and serial entrepreneur with more than 14 years of global technology management experience in London, Hong Kong, Guangzhou, Beijing, and the Bay Area. Previously, Simon was the cofounder and CEO of PredictionIO, a leading open source machine learning server (acquired by Salesforce). Simon holds a BSE in computer science from the University of Michigan, Ann Arbor, and a PhD in machine learning from University College London.

Presentations

The journey to Einstein: Building a multitenancy AI platform that powers hundreds of thousands of businesses Session

Salesforce recently released Einstein, which brings AI into its core platform to power every business. The secret behind Einstein is an underlying platform that accelerates AI development at scale for both internal and external data scientists. Simon Chan shares his experience building this unified platform for a multitenancy, multibusiness cloud enterprise.

Karim Chine is a London-based software architect and entrepreneur and the author and designer of RosettaHUB. Previously, he held positions within academic research laboratories and industrial R&D departments, including Imperial College London, EBI, IBM, and Schlumberger. Karim’s interests include large-scale distributed software design, cloud computing applications in research and education, open source software ecosystems, and open science. Since 2009, he has collaborated with the European Commission as an independent expert for the research e-infrastructure program and for the future and emerging technologies program. He has also served as an evaluator and a reviewer of many of EU’s flagship projects related to grids, desktop grids, scientific clouds, and science gateways. Karim holds degrees from Ecole Polytechnique and Telecom ParisTech.

Presentations

rosettaHUB: A global hub for reproducible and collaborative data science Session

Karim Chine offers an overview of rosettaHUB—which aims to establish a global open data science metacloud centered on usability, reproducibility, auditability, and shareability—and shares the results of the rosettaHUB/AWS Educate initiative, which involved 30 higher education institutions and research labs and over 3,000 researchers, educators, and students.

Michael Chui is a San Francisco-based partner in the McKinsey Global Institute, where he directs research on the impact of disruptive technologies, such as big data, social media, and the internet of things, on business and the economy. Previously, as a McKinsey consultant, Michael served clients in the high-tech, media, and telecom industries on multiple topics. Prior to joining McKinsey, he was the first chief information officer of the City of Bloomington, Indiana, and was the founder and executive director of HoosierNet, a regional internet service provider. Michael is a frequent speaker at major global conferences and his research has been cited in leading publications around the world. He holds a BS in symbolic systems from Stanford University and a PhD in computer science and cognitive science and an MS in computer science, both from Indiana University.

Presentations

Executive Briefing: Artificial intelligence Session

Executive Briefing from Michael Chui

Eric Colson is chief algorithms officer at Stitch Fix as well as an advisor to several big data startups. Previously, Eric was vice president of data science and engineering at Netflix. He holds a BA in economics from SFSU, an MS in information systems from GGU, and an MS in management science and engineering from Stanford.

Presentations

Differentiating by data science Session

While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why this means companies need to think very differently about the role and placement of data science in the organization.

Riccardo Gianpaolo Corbella is a Milan-based consulting big data engineer at Data Reply IT, where he develops effective big data solutions based on open source technologies. Riccardo is interested in data mining and distributed systems. He holds a BSc and an MSc in computer science from the Università degli Studi di Milano.

Presentations

How an Italian company rules the world of insurance: Facing the technological challenges of turning data into value Session

With more than 4.5 million black boxes, Italian car insurance has the most telematics clients in the world. Riccardo Corbella and Beniamino Del Pizzo explore the data management challenges that occur in a streaming context when the amount of data to process is gigantic and share a data management model capable of providing the scalability and performance needed to support massive growth.

George Corugedo is a cofounder and chief technology officer at RedPoint Global, where he is responsible for directing the development of the RedPoint Customer Engagement Hub, RedPoint’s leading enterprise customer engagement solution. A former math professor and seasoned technology executive, George has more than two decades of business and technical experience. He left academia to cofound Accenture’s Customer Insights practice, which specializes in strategic data utilization, analytics, and customer strategy. Previously, he was also director of client delivery at ClarityBlue, a provider of hosted customer intelligence solutions, and COO and CIO of Riscuity, a receivables management company that specializes in using analytics to drive collections.

Presentations

Using real-time machine learning and big data to drive customer engagement and digital transformation (sponsored by RedPoint Global) Session

Driving digital transformation is a vital component of continued organizational success and more personalized customer engagement. The best results will come from operationalizing data to automate decisions with machine learning. George Corugedo explains how RedPoint’s customers use connected enterprise data, machine learning, and analytics to impact their businesses.

Dustin Cote is a customer operations engineer at Confluent. Over his career, Dustin has worked in a variety of roles from Java developer to operations engineer. His most recent focus is distributed systems in the big data ecosystem, with Apache Kafka being his software of choice.

Presentations

Mistakes were made, but not by us: Lessons from a year of supporting Apache Kafka Session

Dustin Cote shares his experience troubleshooting Apache Kafka in production environments and explains how to avoid pitfalls like message loss or performance degradation in your environment.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata + Hadoop World conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Findata welcome Tutorial

Alistair Croll and Rob Passarella welcome you to Findata Day.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Bradford Cross is a founding partner at DCVC, a leading machine learning and big data venture capital fund. Previously, Bradford founded Prismatic, which used machine learning for personalized content ranking and natural language processing for topic classification, and FlightCaster, which used machine learning to predict the real-time state of the global air traffic network using FAA, carrier, and weather data. A hedge fund investor and a venture investor, Bradford started his career working on statistical value and momentum strategies at O’Higgins Asset Management and was a founding partner of Data Collective. He was also a systems engineer and worked on distributed systems at Google. Bradford studied computer engineering and finance at Virginia Tech and mathematics at Berkeley.

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

How machine learning is used in fintech Findata

Bradford Cross offers an overview of machine learning applications in the financial services sector across banking, insurance, investments, real estate, and consumer financial services and contrasts these approaches with traditional quant finance.

Michael Crutcher is the director of product management at Cloudera, where he is responsible for the direction of Cloudera’s storage products, which include HDFS, HBase, and Parquet. He’s also responsible for managing strategic partnerships with storage vendors.

Presentations

The sunset of lambda: New architectures amplify IoT impact Session

A long time ago in a data center far, far away, we deployed complex lambda architectures as the backbone of our IoT solutions. Though hard, they enabled collection of real-time sensor data and slightly delayed analytics. Michael Crutcher and Ryan Lippert explain why Apache Kudu, a relational storage layer for fast analytics on fast data, is the key to unlocking the value in IoT data.

Nick Curcuru is vice president of enterprise information management practice at Mastercard, where he is responsible for leading a team that works with organizations to generate revenue through smart data, architect next-generation technology platforms, and protect data assets from cyberattacks by leveraging Mastercard’s information technology and information security resources and creating peer-to-peer collaboration with their clients. Nick brings over 20 years of global experience successfully delivering large-scale advanced analytics initiatives for such companies as the Walt Disney Company, Capital One, Home Depot, Burlington Northern Railroad, Merrill Lynch, Nordea Bank, and GE. He frequently speaks on big data trends and data security strategy at conferences and symposiums, has publishing several articles on security, revenue management, and data security, and has contributed to several books on the topic of data and analytics.

Presentations

Architecting security across the enterprise: Instilling confidence and stewardship every step of the way Session

Cybersecurity is now a topic in the boardroom, as organizations are scrambling to increase their security posture. To decrease breach threats, Mastercard brings data security into its system design process. Nick Curcuru shares best practices and lessons learned protecting 160 million transactions per hour over Mastercard's network and securing 16+ petabytes of data at rest.

Paul Curtis is a principal solutions engineer at MapR, where he provides pre- and postsales technical support to MapR’s worldwide Systems Engineering team. Previously, Paul served as senior operations engineer for Unami, a startup founded to deliver on the promise of interactive TV for consumers, networks, and advertisers; systems manager for Spiral Universe, a company providing school administration software as a service; senior support engineer positions at Sun Microsystems; enterprise account technical management positions for both Netscape and FileNet; and roles in application development at Applix, IBM Service Bureau, and Ticketron. Paul got started in the ancient personal computing days; he began his first full-time programming job on the day the IBM PC was introduced.

Presentations

Why containers and microservices need streaming data Session

A microservices architecture benefits from the agility of containers for convenient, predictable deployment of applications, while persistent, performant message streaming makes both work better. Paul Curtis explores these infrastructure components and discusses the design of highly scalable real-world systems that take advantage of this powerful triad.

Shannon Cutt is the development editor in the data practice area at O’Reilly Media.

Presentations

Data 101 welcome Data 101

Shannon Cutt welcomes you to the Data 101 tutorial.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Brian Dalessandro is the head of data science at Zocdoc, an online doctor marketplace and booking tool, and an adjunct professor for NYU’s Center for Data Science graduate program. Previously, Brian was vice president of data science at online advertising firm Dstillery. A veteran data scientist and leader with over 15 years of experience developing machine learning driven practices and products, Brian holds several patents and has published dozens of peer-reviewed articles on the subjects of causal inference, large-scale machine learning, and data science ethics. Brian is also the drummer for the critically acclaimed indie rock band Coastgaard.

Presentations

Challenges in using machine learning to direct healthcare services Session

Zocdoc is an online marketplace that allows easy doctor discovery and instant online booking. However, dealing with healthcare involves many constraints and challenges that render standard approaches to common problems infeasible. Brian Dalessandro surveys the various machine learning problems Zocdoc has faced and shares the data, legal, and ethical constraints that shape its solution space.

Atul Dalmia is vice president of global information management at American Express, where he is responsible for leading the company’s data and platform strategy and driving innovation in acquisition, marketing, and servicing across the customer lifecycle and across channels. He is also responsible for accelerating development on AXP’s big data platform to drive innovation and speed to market while driving cost efficiencies for the enterprise. Atul holds a master’s degree from Massachusetts Institute of Technology and a bachelor’s degree from the Indian Institute of Technology, Chennai.

Presentations

Enterprise digital transformation using big data Session

Big data decisioning is critical to driving real-time business decisions in our digital age. But how do you begin the transformation to big data? The key is enterprise adoption across a variety of end users. Atul Dalmia shares best practices learned from American Express's five-year journey, the biggest challenges you’ll face, and ideas on how to solve them.

Satish Varma Dandu is a Data Science & Engineering Manager at NVIDIA. At NVIDIA, he lead teams that build massive end-to-end big data & deep learning platforms, handling billions of events per day for real time analytics, data warehousing , and AI platforms using deep learning to improve user experience for millions of users. Prior to NVIDIA, Satish was leading data engineering teams at both startups and large public companies. His areas of interest are in building large scale engineering platforms, big data engineering, GPU data acceleration and deep learning. Satish has an MS in Computer Science from Univ. Of Houston and currently pursuing management program at Stanford University.

Presentations

Training a deep learning risk detection platform Session

Joshua Patterson and Michael Balint explain how to bootstrap a deep learning framework to detect risk and threats in production operational systems using best-of-breed GPU-accelerated open source tools.

Shirshanka Das is the architect for LinkedIn’s Data Analytics Infrastructure team. Shirshanka was one of the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. His current focus at LinkedIn includes all things Hadoop, high-performance distributed OLAP engines, large-scale data ingestion, transformation and movement, and data lineage and discovery.

Presentations

Taming the ever-evolving compliance beast: Lessons learned at LinkedIn Session

Shirshanka Das and Tushar Shanbhag explore the big data ecosystem at LinkedIn and share its journey to preserve member privacy while providing data democracy. Shirshanka and Tushar focus on three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement platform, and a unified data access layer.

Prior to joining Amplify as a general partner, Mike Dauber spent over six years at Battery Ventures, where he led early-stage enterprise investments on the West Coast, including Battery’s investment in a stealth security company that is also in Amplify’s portfolio. Most recently, Mike sat on the boards of Continuuity, Duetto, Interana, and Platfora. Mike previously invested in Splunk and RelateIQ, which was recently acquired by Salesforce. Mike began his career as a hardware engineer at a startup and later held product, business development, and sales roles at Altera and Xilinx. Mike is a frequent speaker at conferences and is on the advisory board of both the O’Reilly Strata Conference and SXSW. He was named to Forbes magazine’s 2015 Midas Brink List. Mike holds a BS in electrical engineering from the University of Michigan in Ann Arbor and an MBA from the University of Pennsylvania’s Wharton School.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Gerard de Melo is an assistant professor of computer science at Rutgers University, where he heads a team of researchers working on big data analytics, natural language processing, and web mining. Gerard’s research projects include UWN/MENTA, one of the largest multilingual knowledge bases, and Lexvo.org, an important hub in the web of data. Previously, he was a faculty member at Tsinghua University, one of China’s most prestigious universities, where he headed the Web Mining and Language Technology group, and a visiting scholar at UC Berkeley, where he worked in the ICSI AI group. He serves as an editorial board member for computational intelligence, for the Journal of Web Semantics, the Springer Language Resources and Evaluation journal, and the Language Science Press TMNLP book series. Gerard has published over 80 papers, with best paper or demo awards at WWW 2011, CIKM 2010, ICGL 2008, and the NAACL 2015 Workshop on Vector Space Modeling, as well as an ACL 2014 best paper honorable mention, a best student paper award nomination at ESWC 2015, and a thesis award for his work on graph algorithms for knowledge modeling. He holds a PhD in computer science from the Max Planck Institute for Informatics.

Presentations

Learning meaning from web-scale big data HDS

How can we exploit the massive amounts of data now available on the web to enable more intelligent applications? Gerard de Melo shares results on applying deep learning techniques to web-scale amounts of data to learn neural representations of language and world knowledge. The resulting resources can be used in Spark to work with text in over 300 languages.

Beniamino Del Pizzo is a big data engineer at Data Reply IT, where he works on data ingest with a focus on Apache Kafka and Spark applications. Beniamino is passionate about big data, streaming applications, distributed computation, and data analysis. He holds a master’s degree in computer engineering; his thesis outlined an evolutionary approach to using Apache Spark with TSK-fuzzy systems for big data.

Presentations

How an Italian company rules the world of insurance: Facing the technological challenges of turning data into value Session

With more than 4.5 million black boxes, Italian car insurance has the most telematics clients in the world. Riccardo Corbella and Beniamino Del Pizzo explore the data management challenges that occur in a streaming context when the amount of data to process is gigantic and share a data management model capable of providing the scalability and performance needed to support massive growth.

Noemi Derzsy is a postdoctoral research associate at the Social Cognitive Network Academic Research Center at Rensselaer Polytechnic Institute, where she uses data sets to analyze, understand, and model complex systems using network science and data science techniques. She’s also a NASA datanaut. Noemi holds a PhD in physics.

Presentations

Topic modeling openNASA data Session

Open source data has enabled society to engage in community-based research and has provided government agencies with more visibility and trust from individuals. Noemi Derzsy offers an overview of the openNASA platform and discusses openNASA metadata analysis and tools for applying NLP and topic modeling techniques to understand open government dataset associations.

Stephen Devine is a Seattle-based data engineer at Big Fish Games, where he wrangles events sent from millions of mobile phones through Kafka into Hive. Previously, he did similar things for Xbox One Live Services using proprietary Microsoft technology and worked on several releases of Internet Explorer at Microsoft.

Presentations

Working within the Hadoop ecosystem to build a live-streaming data pipeline Session

Companies are increasingly interested in processing and analyzing live-streaming data. The Hadoop ecosystem includes platforms and software library frameworks to support this work, but these components require correct architecture, performance tuning, and customization. Stephen Devine and Kalah Brown explain how they used Spark, Flume, and Kafka to build a live-streaming data pipeline.

Ewa Ding is a product manager and product designer at Cloudera, where she is responsible for SQL workload optimization solutions including traditional data warehouse workloads off-load, and Impala/Hive workload optimization. She also manages product direction and strategy of Navigator Optimizer (formerly known as Xplain.io). Previously, Ewa held leadership positions driving product strategy and product design for several enterprise SaaS applications, including Xplain.io.

Presentations

Optimizing the data warehouse at Visa Session

At Visa, the process of optimizing the enterprise data warehouse and consolidating data marts by migrating these analytic workloads to Hadoop has played a key role in the adoption of the platform and how data has transformed Visa as an organization. Nandu Jayakumar and Ewa Ding share Visa’s journey along with some best practices for organizations migrating workloads to Hadoop.

Thomas W. Dinsmore is director of product marketing for Cloudera Data Science. Previously, he served as a knowledge expert on the Strategic Analytics team at the Boston Consulting Group; director of product management for Revolution Analytics; analytics solution architect at IBM Big Data Solutions; and a consultant at SAS, PricewaterhouseCoopers, and Oliver Wyman. Thomas has led or contributed to analytic solutions for more than five hundred clients across vertical markets and around the world, including AT&T, Banco Santander, Citibank, Dell, J.C.Penney, Monsanto, Morgan Stanley, Office Depot, Sony, Staples, United Health Group, UBS, and Vodafone. His international experience includes work for clients in the United States, Puerto Rico, Canada, Mexico, Venezuela, Brazil, Chile, the United Kingdom, Belgium, Spain, Italy, Turkey, Israel, Malaysia, and Singapore.

Presentations

Data science at team scale: Considerations for sharing, collaborating, and getting to production Session

Data science alone is easy. Data science with others, whether in the enterprise or on shared distributed systems, requires a bit more work. Tristan Zajonc and Thomas Dinsmore discuss common technology considerations and patterns for collaboration in large teams and for moving machine learning into production at scale.

Leo Dirac is a principal engineer on the Amazon AI team at Amazon Web Services. Previously, he led the engineering team that launched the Amazon Machine Learning service. Leo has a background in physics. He started writing software professionally in the 1980s. In 2012, he became fascinated with deep learning and has been building systems with it ever since.

Presentations

Practical deep learning for understanding images Session

Leo Dirac demonstrates how to apply the latest deep learning techniques to semantically understand images. You'll learn what embeddings are, how to extract them from your images using deep convolutional neural networks (CNNs), and how they can be used to cluster and classify large datasets of images.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

GDPR: Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Steven Ross and Mark Donsky outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Mike Driscoll is the founder and CEO of Metamarkets. Previously, Mike spent more than a decade focused on making the most of data to help companies grow and developed data analytics solutions for online retail, life sciences, digital media, insurance, and banking. He also successfully founded and sold two companies: Dataspora, a life science analytics company, and CustomInk, an early pioneer in customized apparel. Mike began his career as a software engineer for the Human Genome Project. He holds an AB in government from Harvard and a PhD in bioinformatics from Boston University.

Presentations

The cognitive design principles of interactive analytics Session

Most analytics tools in use today provide static visuals that don’t reveal the full, real-time picture. Mike Driscoll shows how to take an interactive approach to analytics. From design techniques to discovering new forms of data exploration, he demonstrates how to put the full power of big data into the hands of the people who need it to make key business decisions.

Leigh Drogen is the founder and CEO of Estimize, a crowdsourced financial estimates platform, facilitating a community of independent analysts, including financial professionals, offering a more accurate view of market expectations. Prior to founding Estimize, Leigh ran Surfview Capital, a New York based quantitative investment management firm trading medium frequency momentum strategies. He was also an early member of the team at StockTwits where he worked on product and business development. Leigh started his career as an analyst at Geller Capital, a quantitative investment management firm in New York. He holds a B.A. from Hunter College with focus in behavioral economics and war theory. When he’s not staring at rectangular lightboxes, Leigh can be found on the ice rink playing hockey, behind a grill, or off in search of waves to surf around the world.

Presentations

Crowdsourced alpha: The future of investment research Findata

Findata session with Leigh Drogen

Mathieu Dumoulin is a data scientist in MapR Technologies’s Tokyo office, where he combines his passion for machine learning and big data with the Hadoop ecosystem. Mathieu started using Hadoop from the deep end, building a full unstructured data classification prototype for Fujitsu Canada’s Innovation Labs, a project that eventually earned him the 2013 Young Innovator award from the Natural Sciences and Engineering Research Council of Canada. Afterward, he moved to Tokyo with his family, where he worked as a search engineer at a startup and a managing data scientist for a large Japanese HR company, before coming to MapR.

Presentations

State-of-the-art robot predictive maintenance with real-time sensor data Session

Mateusz Dymczyk and Mathieu Dumoulin showcase a working, practical, predictive maintenance pipeline in action and explain how they built a state-of-the-art anomaly detection system using big data frameworks like Spark, H2O, TensorFlow, and Kafka on the MapR Converged Data Platform.

Ted Dunning has been involved with a number of startups—the latest is MapR Technologies, where he is chief application architect working on advanced Hadoop-related technologies. Ted is also a PMC member for the Apache Zookeeper and Mahout projects and contributed to the Mahout clustering, classification, and matrix decomposition algorithms. He was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Presentations

Tensor abuse in the workplace Session

Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case).

Sastry Durvasula is vice president and global technology head of information management and digital capabilities at American Express, where he is responsible for leading the IT strategy and transformational development to power the company’s data-driven capabilities and digital products globally and for delivering enterprise-wide analytics and business intelligence platforms and supporting critical risk, fraud, and regulatory demands. Sastry has held numerous IT executive positions at American Express, with a successful track record in driving innovation, leading large-scale transformation programs, and managing global technology investment roadmaps while building high-performing organizations and fostering an agile culture. Sastry is passionate about big data, transformational technology, and inspiring others to reach their potential. Most recently, Sastry led the launch of American Express’s big data platform and the transformation of its enterprise data warehouse. Sastry’s team also led the development of the company’s API strategy and the Sync platform to deliver innovative products, drive social commerce ,and launch external partnerships. He is credited with several patents in the information management, payments, mobile, and digital commerce technology space. Sastry plays an active role in leading industry forums and technology executive networks, including the Evanta CDO and HMG CIO Summits. He holds master’s and bachelor’s degrees in engineering, as well as various professional certifications.

Presentations

AI at scale at American Express: Walking the talk Findata

The AI landscape is rapidly evolving, offering a lot of promise. . .and a lot of hype. Sastry Durvasula and Priya Koul explain how American Express is building an AI ecosystem at scale to unlock differentiated customer experiences and open up new business opportunities.

Mateusz Dymczyk is a Tokyo-based software engineer at H20.ai, where he works as a researcher on machine learning and NLP projects. He works on distributed machine learning projects including the core H2O platform and Sparkling Water, which integrates H2O and Apache Spark. Previously, he worked at Fujitsu Laboratories. Mateusz loves all things distributed and machine learning and hates buzzwords. In his spare time, he participates in the IT community by organizing, attending, and speaking at conferences and meetups. Mateusz holds an MSc in computer science from AGH UST in Krakow, Poland.

Presentations

State-of-the-art robot predictive maintenance with real-time sensor data Session

Mateusz Dymczyk and Mathieu Dumoulin showcase a working, practical, predictive maintenance pipeline in action and explain how they built a state-of-the-art anomaly detection system using big data frameworks like Spark, H2O, TensorFlow, and Kafka on the MapR Converged Data Platform.

Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.

Presentations

End-to-end data discovery and lineage in a heterogeneous big data environment with Apache Atlas and Avro Session

Barbara Eckman offers an overview of Comcast’s streaming data platform, comprised of a variety of ingest, transformation, and storage services, which uses Apache Avro schemas to support end-to-end data governance, Apache Atlas for data discovery and lineage, and custom asynchronous messaging libraries to notify Atlas of new data and schema entities and lineage links as they are created.

Bob Eilbacher is the vice president of operations at Caserta. An experienced operations and client services professional with a successful track record of providing technology solutions and services that focus on uncovering analytics insights and driving efficiency across an enterprise, Bob works directly with clients to develop strategies and implement solutions that transform structured and unstructured data into analytics-driven business insights. He has a strong background in technology and a deep appreciation for finding the right solution. Previously, he held executive roles at Verint and Ness Technologies.

Presentations

Creating a DevOps practice for analytics Session

Building an efficient analytics environment requires a strong infrastructure. Bob Eilbacher explains how to implement a strong DevOps practice for data analysis, starting with the necessary cultural changes that must be made at the executive level and ending with an overview of potential DevOps toolchains.

Amie Elcan is a principal architect in CenturyLink’s the Data Network Strategies organization, where her current areas of focus are traffic modeling, application traffic analytics, and data science. Amie has worked in the telecommunications industry for over 20 years delivering traffic based assessments that drive optimal network architecture and engineering design decisions.

Presentations

Classification of telecom network traffic: Insight gained using statistical learning on a big data platform DCS

Statistical learning techniques applied to network data provide a comprehensive view of traffic behavior that would not be possible using traditional descriptive statistics alone. Amie Elcan shares an application of the random forest classification method using network data queried from a big data platform and demonstrates how to interpret the model output and the value of the data insight.

Javier “Xavi” Esplugas is the vice president of IT planning and architecture at DHL Supply Chain. Xavi has served in a number of roles at DHL Supply Chain. Previously, he drove the standardization and innovation agenda in Europe, which included DHL’s vision picking, robotics, and the internet of things. Xavi holds an MSC in computer engineering from Universitat Politècnica de Catalunya in Barcelona.

Presentations

Seeing everything so managers can act on anything: The IoT in DHL Supply Chain operations Session

DHL has created an IoT initiative for its supply chain warehouse operations. Javier Esplugas and Kevin Parent explain how DHL has gained unprecedented insight—from the most comprehensive global view across all locations to a unique data feed from a single sensor—to see, understand, and act on everything that occurs in its warehouses with immersive operational data visualization.

Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Prior to Alluxio, Bin worked at Google building next-generation storage infrastructure, where he won Google’s Technical Infrastructure award. Bin holds a PhD in computer science from Carnegie Mellon University.

Presentations

Best practices for using Alluxio with Spark Session

Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for managing data across different storage. Many deployments use Alluxio with Spark because Alluxio helps Spark further accelerate applications. Bin Fan and Gene Pang explain how Alluxio makes Spark more effective and share production deployments of Alluxio and Spark working together.

Carson Farmer is lead data scientist at Set, a technology startup focused on building innovative new technologies to help mobile application developers make better use of behavioral data, with a focus on protecting users’ privacy. Carson is also an assistant professor of geocomputation in the Department of Geography at the University of Colorado Boulder, where his research focuses on human mobility and space-time interactions.

Presentations

Learning location: Real-time feature extraction for mobile analytics Session

Location-based data is full of information about our everyday lives, but GPS and WiFi signals create extremely noisy mobile location data, making it hard to extract features, especially when working with real-time data. Carson Farmer and Sander Pick explore new strategies for extracting information from location data while remaining scalable, privacy focused, and contextually aware.

Basil Faruqui is lead solutions manager at BMC, where he leads the development and execution of big data and multicloud strategy for BMC’s Digital Business Automation line of business (Control-M). Basil’s key areas of focus include evangelizing the role automation plays in delivering successful big data projects and advising companies on how to build scalable automation strategies for cloud and big data initiatives. Basil has over 15 years of industry experience in various areas of software research and development, customer support, and knowledge management.

Presentations

Automated data pipelines in hybrid environments: Myth or reality? (sponsored by BMC) Session

Are you building, running, or managing complex data pipelines across hybrid environments spanning multiple applications and data sources? Doing this successfully requires automating dataflows across the entire pipeline, ideally controlled through a single source. Basil Faruqui walks you through a customer journey to automate data pipelines across a hybrid environment.

Avrilia Floratau is a senior scientist at Microsoft’s Cloud and Information Services Lab, where her research is focused on scalable real-time stream processing systems. She is also an active contributor to Heron, collaborating with Twitter. Previously, Avrilia was a research scientist at IBM Research working on SQL-on-Hadoop systems. She holds a PhD in data management from the University of Wisconsin-Madison.

Presentations

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and insights on how to address them.

Parisa Foster is cofounder and president of mobile prediction game and brand engagement platform Play The Future. Previously, Parisa was the vice president of marketing and business development at Budge, a leading mobile game studio. A pioneer in the mobile space, Parisa began her career in business intelligence at Airborne Mobile, one of Canada’s first mobile startups, before joining the first Mobile and API team at Yellow Pages Group, transforming digital at Just for Laughs, and consulting for a number of international clients.

Presentations

Using data to play (and forecast) the future DCS

Technology startup Play The Future developed a mobile prediction game in which users predict trending events and get rewarded for accuracy. Parisa Foster explains Play The Future’s unique predictive gameplay, discusses the challenges of a groundbreaking project, and reveals emerging insights derived from its data about how people make predictions.

Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

A deep dive into running data engineering workloads in AWS Tutorial

Jennifer Wu, Andrei Savu, Vinithra Varadharajan, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.

Michael J. Freedman is a professor in the Computer Science department at Princeton University as well as the cofounder and CTO of Timescale, which provides an open source time series database optimized for fast ingest and complex queries. His research broadly focuses on distributed systems, networking, and security. He developed and operates several self-managing systems, including CoralCDN (a decentralized content distribution network) and DONAR (a server resolution system that powered the FCC’s Consumer Broadband Test), both of which serve millions of users daily. Michael’s other research has included software-defined and service-centric networking, cloud storage and data management, untrusted cloud services, fault-tolerant distributed systems, virtual world systems, peer-to-peer systems, and various privacy-enhancing and anticensorship systems. Michael’s work on IP geolocation and intelligence led him to cofound Illuminics Systems, which was acquired by Quova (now part of Neustar). His work on programmable enterprise networking (Ethane) helped form the basis for the OpenFlow/software-defined networking (SDN) architecture. His honors include the Presidential Early Career Award for Scientists and Engineers (PECASE), a Sloan fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. Michael holds a PhD in computer science from NYU’s Courant Institute and both an SB and an MEng degree from MIT.

Presentations

When boring is awesome: Making PostgreSQL scale for time series data Session

Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open source and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries.

Jon Fuller is an application scientist at KNIME, where he works with customers to deploy advanced analytics and help them understand the power of working with cloud resources. Previously, Jon was a postdoctoral researcher at the Heidelberg Institute for Theoretical Studies, where he published several papers on computational biology topics. Jon is a lapsed physicist. He holds a PhD in bioinformatics from the University of Leeds.

Presentations

Deploying deep learning to assist the digital pathologist Session

Jon Fuller and Olivia Klose explain how KNIME, Apache Spark, and Microsoft Azure enable fast and cheap automated classification of malignant lymphoma type in digital pathology images. The trained model is deployed to end users as a web application using the KNIME WebPortal.

Jerrard Gaertner is a senior adviser for big data education at the University of Toronto and the vice president of Managed Analytic Services. Jerrard is a CPA, a security and privacy specialist, a futurist, and an ethicist.

Presentations

What I learned from teaching 1,500 analytics students Session

Engaging, teaching, mentoring, and advising mature, mostly employed, often enthusiastic and ambitious adult learners at University of Toronto has taught Jerrard Gaertner more about analytics in the real world than he ever imagined. Jerrard shares stories he learned about everything from hyped-up expectations and internal sabotage to organizational streamlining and creating transformative insight.

Kaushal has been a Senior Software Engineer at Trifacta since February 2015. In addition to building Trifacta’s fast interactive transformation engine (Photon), he has built various data transformation features that improve user utility and usability of the product. Prior to joining Trifacta, Kaushal built prediction and estimation software at Nvidia. He holds an M.S. in Computer Science and Engineering.

Presentations

Interactive data exploration and analysis at enterprise scale Session

Sean Kandel shares best practices for building and deploying Hadoop applications to support large-scale data exploration and analysis across an organization.

Eddie Garcia is chief information security officer at Cloudera, a leader in enterprise analytic data management, where he draws on his more than 20 years of information and data security experience to help Cloudera Enterprise customers reduce security and compliance risks associated with sensitive datasets stored and accessed in Apache Hadoop environments. Previously, Eddie was the vice president of infosec and engineering for Gazzang prior to its acquisition by Cloudera, where he architected and implemented secure and compliant big data infrastructures for customers in the financial services, healthcare, and public sector industries to meet PCI, HIPAA, FERPA, FISMA, and EU data security requirements. He was also the chief architect of the Gazzang zNcrypt product and is author of three patents for data security.

Presentations

Machine learning to spot cybersecurity incidents at scale Session

Machine data from firewalls, network switches, DNS servers, and many other devices in your organization may be untapped potential for cybersecurity threat analytics using machine learning. Eddie Garcia explores how companies are using Apache Hadoop-based approaches to protect their organizations and explains how Apache Spot is tackling this challenge head-on.

Yael Garten is director of data science at LinkedIn, where she leads a team that focuses on understanding and increasing growth and engagement of LinkedIn’s 400 million members across mobile and desktop consumer products. Yael is an expert at converting data into actionable product and business insights that impact strategy. Her team partners with product, engineering, design, and marketing to optimize the LinkedIn user experience, creating powerful data-driven products to help LinkedIn’s members be productive and successful. Yael champions data quality at LinkedIn; she has devised organizational best practices for data quality and developed internal data tools to democratize data within the company. Yael also advises companies on informatics methodologies to transform high-throughput data into insights and is a frequent conference speaker. She holds a PhD in biomedical informatics from the Stanford University School of Medicine, where her research focused on information extraction via natural language processing to understand how human genetic variations impact drug response, and an MSc from the Weizmann Institute of Science in Israel.

Presentations

The unspoken challenges of doing data science Session

Data science is a rewarding career. It's also hard. Yael Garten explores what data scientists do, how they fit into the broader company organization, and how they can excel at their trade and shares the hard and soft skills required, challenges to watch out for, and tips and tricks for success and #DataScienceHappiness.

Alison Gilles is director of engineering for data infrastructure at Spotify, where she coaches and leads teams in backend services and data infrastructure. Previously, she led engineering teams at nonprofit organizations in education and corporate social responsibility.

Presentations

Spotify in the cloud: The next evolution of data at Spotify Session

In early 2016, Spotify decided that it didn’t want to be in the data center business. The future was the cloud. Josh Baer and Alison Gilles share Spotify's story and explain what it takes to move to the cloud, covering Spotify's technology choices, challenges faced, and the lessons Spotify learned along the way.

Daniel Goddemeyer is the founder of OFFC NYC, a New York City-based research and design studio that works with global brands, research institutions, and startups to explore future product applications for today’s emerging technologies. Daniel’s research explores how the increasing proliferation of these technologies in our future lives will transform our everyday interactions. His work has been exhibited internationally at the Westbound Shanghai Architecture Biennial, the Data in the 21st Century exhibition at V2 Rotterdam, Data Traces Riga and the Big Bang Data exhibition at London’s Somerset house among others, and he has won or received recognition from the Art Directors Club, the Red Dot Award, the German Design Price, the Kantar Information Is Beautiful award, and the Industrial Designer Society of America.

Presentations

Data futures: Exploring the everyday implications of increasing access to our personal data Session

Increasing access to our personal data raises profound moral and ethical questions. Daniel Goddemeyer and Dominikus Baur share the findings from Data Futures, an MFA class in which students observed each other through their own data, and demonstrate the results with a live experiment with the audience that showcases some of the effects when personal data becomes accessible.

Brian Granger is an associate professor of physics and data science at Cal Poly State University in San Luis Obispo. Brian is a leader of the IPython project, cofounder of Project Jupyter, and an active contributor to a number of other open source projects focused on data science in Python. Recently, he cocreated the Altair package for statistical visualization in Python. He is a advisory board member of NumFOCUS and a faculty fellow of the Cal Poly Center for Innovation and Entrepreneurship.

Presentations

JupyterLab: Building blocks for interactive computing Session

With JupyterLab—an extensible IDE-like web application for data science and computation that is the next generation of the popular Jupyter Notebook—users compute with multiple notebooks, editors, and consoles that work together in a tabbed layout. Brian Granger offers an overview of JupyterLab and demonstrates how to use third-party plugins to extend and customize many aspects of JupyterLab.

Jonathan Gray is the founder and CEO of Cask. Jonathan is an entrepreneur and software engineer with a background in startups, open source, and all things data. Previously, he was a software engineer at Facebook, where he helped drive HBase engineering efforts, including Facebook Messages and several other large-scale projects, from inception to production. An open source evangelist, Jonathan was responsible for helping build the Facebook engineering brand through developer outreach and refocusing the open source strategy of the company. Prior to Facebook, Jonathan founded Streamy.com, where he became an early adopter of Hadoop and HBase. He is now a core contributor and active committer in the community. Jonathan holds a bachelor’s degree in electrical and computer engineering from Carnegie Mellon University.

Presentations

Hybrid data lakes: Unlocking the inevitable (sponsored by Cask) Session

To take advantage of the latest big data technology options in the cloud, more and more enterprises are building hybrid, self-service data lakes. Jonathan Gray discusses the importance of a portability strategy, addresses implementation challenges, and shares customer use cases that will inspire enterprises to embark on a multi-environment data lake journey.

Mark Grover is a software engineer working on Apache Spark at Cloudera. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating) and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Ted Malaska, Gwen Shapira, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Nadeem Gulzar is the head of advanced analytics and architecture at Danske Bank Group, a Nordic bank with strong roots in Denmark and a focus on becoming the most trusted financial partner in the Nordics. Nadeem has taken the lead in establishing advanced analytics and big data technologies within Danske. Previously, he worked with Credit and Marketrisk, where he headed a program to build-up capabilities to calculate risk using Monte Carlo simulation methods. Nadeem holds a BS in computer science, mathematics, and psychology and a master’s degree in computer science, both from Copenhagen University.

Presentations

Fighting financial fraud at Danske Bank with artificial intelligence Session

Fraud in banking is an arms race with criminals using machine learning to improve their attack effectiveness. Ron Bodkin and Nadeem Gulzar explore how Danske Bank uses deep learning for better fraud detection, covering model effectiveness, TensorFlow versus boosted decision trees, operational considerations in training and deploying models, and lessons learned along the way.

Alexandra Gunderson is a data scientist at Arundo Analytics. Her background is in mechanical engineering and applied numerical methods.

Presentations

IIoT data fusion: Bridging the gap from data to value Session

One of the main challenges when working with industrial data is linking the large amount of data and extracting value. Alexandra Gunderson shares a comprehensive preprocessing methodology that structures and links data from different sources, converting the IIoT analytics process from an unorganized mammoth to one more likely to generate insight.

Sijie Guo is the cofounder of Streamlio, a company focused on building a next-generation real-time data stack. Previously, he was the tech lead for messaging group at Twitter, where he cocreated Apache DistributedLog, and worked on push notification infrastructure at Yahoo. He is the PMC chair of Apache BookKeeper.

Presentations

Messaging, storage, or both: The real-time story of Pulsar and Apache DistributedLog Session

Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. Matteo Merli and Sijie Guo offer an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production.

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and insights on how to address them.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him.

Presentations

Getting started with TensorFlow Tutorial

Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from conceptual overviews all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.

Yunsong Guo is a staff engineer at Pinterest developing Homefeed ranking ML models. Yunsong is the founding member of the Homefeed Ranking team and has led keys projects to turn Pinterest homefeed ranking from time based to logistic regression based and later to GBDT-powered ranking systems. Such projects and feature improvements resulted in more than 100% homefeed user engagement gains. Previously, he spent a few years working in London and Hong Kong on algorithmic trading, high-frequency trading, and statistical arbitrage using machine-learned models. Yunsong holds a PhD in computer science from Cornell University with a focus on machine learning.

Presentations

How Pinterest uses machine learning to achieve ~200M monthly active users HDS

Pinterest has always prioritized user experiences. Yunsong Guo explores how Pinterest uses machine learning—particularly linear, GBDT, and deep NN models—in its most important product, Pinterest Homefeed, to improve user engagement. Along the way, Yunsong shares how Pinterest drastically increased its international user engagement as well as lessons on finding the most impactful features.

Sebastian Gutierrez is a data entrepreneur who focuses on data-driven companies. Sebastian founded DashingD3js.com to provide online and corporate training in data visualization and D3.js to a diverse client base, including corporations like the New York Stock Exchange, American Express, Intel, General Dynamics, Salesforce, Thomson Reuters, Oracle, Bloomberg Businessweek, universities, and dozens of startups. More than 1,000 people have attended his live trainings and many more have succeeded with his online D3.js training. Sebastian also cofounded DataScienceWeekly.org, which provides news, analysis, and commentary in data science. Its Data Science Weekly newsletter reaches tens of thousands of aspiring and professional data scientists. He is also the author of Data Scientist at Work, a collection of interviews with many of the world’s most influential and interesting data scientists from across the spectrum of public companies, private companies, startups, venture investors, and nonprofits. Sebastian holds a BS in mathematics from MIT and an MA in economics from the University of San Francisco.

Presentations

Improve business decision making with the science of human perception Session

You likely already use business metrics and analytics to achieve success in your data-driven organization. Sebastian Gutierrez demonstrates how to use the science of human perception to drastically improve your data visualizations, reports, and dashboards to drive better decisions and results.

Felix GV is a software engineer working on LinkedIn’s data infrastructure. He works on Voldemort and Venice and keeps a close eye on Hadoop, Kafka, Samza, Azkaban, and other systems.

Presentations

Introducing Venice: A derived datastore for batch, streaming, and lambda architectures Session

Companies with batch and stream processing pipelines need to serve the insights they glean back to their users, an often-overlooked problem that can be hard to achieve reliably and at scale. Felix GV and Yan Yan offer an overview of Venice, a new data store capable of ingesting data from Hadoop and Kafka, merging it together, replicating it globally, and serving it online at low latency.

Patrick Hall is a senior data scientist and product engineer at H2O.ai, where he helps customers derive substantive business value from machine learning technologies. His product work at H2O.ai focuses on two important aspects of applied machine learning: model interpretability and model deployment. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Previously, Patrick held global customer-facing and R&D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the eleventh person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Interpretable AI: Not just for regulators Session

Interpreting deep learning and machine learning models is not just another regulatory burden to be overcome. People who use these technologies have the right to trust and understand AI. Patrick Hall and Sri Satish share techniques for interpreting deep learning and machine learning models and telling stories from their results.

Eui-Hong (Sam) Han is the director of big data and personalization at the Washington Post. Sam is an experienced practitioner of data mining and machine learning and has an in-depth understanding of analytics technologies. He has successfully applied these technologies to solve real business problems. At the Washington Post, he leads a team building an integrated big data platform to store all aspects of customer profiles and activities from both digital and print circulation, content metadata, and business data. His team is building an infrastructure, tools, and services to provide personalized experience to customers, empower the newsroom with data for better decisions, and provide targeted advertising capability. Previously, he led the Big Data practice at Persistent Systems, started the Machine Learning group in Sears Holdings’s online business unit, and worked for a data mining startup company. Sam’s expertise includes data mining, machine learning, information retrieval, and high-performance computing. He holds a PhD in computer science from the University of Minnesota.

Presentations

Automatic comments moderation with ModBot at the Washington Post Session

The quality of online comments is critical to the Washington Post. However, the quality management of the comment section currently requires costly manual resources. Eui-Hong Han and Ling Jiang discuss Modbot, a machine learning-based tool developed for automatic comments moderation, and share the challenges they faced in developing and deploying ModBot into production.

Luke (Qing) Han is the coounder and CEO of Kyligence, which provides a leading intelligent data platform powered by Apache Kylin to simplify big data analytics from on-premises to the cloud. Luke is the cocreator and PMC chair of Apache Kylin, where he contributes his passion to driving the project’s strategy, roadmap, and product design. For past few years, Luke has been working on growing Apache Kylin’s community, building its ecosystem, and extending its adoptions globally. Previously, he was big data product lead at eBay, where he managed Apache Kylin, engaged customers, and coordinated various teams from different geographical locations, and chief consultant at Actuate China.

Presentations

Building enterprise OLAP on Hadoop in finance with Apache Kylin (sponsored by Kyligence) Session

Luke Han offers an overview of Apache Kylin and its enterprise version KAP and shares a case study of how a top finance company migrated to Apache Kylin on top of Hadoop from its legacy Cognos and DB2 system.

Tom Hanlon is a senior instructor at Skymind, where he delivers courses on the wonders of the Hadoop ecosystem. Before beginning his relationship with Hadoop and large distributed data, he had a happy and lengthy relationship with MySQL with a focus on web operations. He has been a trainer for MySQL, Sun, and Percona.

Presentations

Securely building deep learning models for digital health data Tutorial

Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You'll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.

Behrooz Hashemian, Ph.D., is a researcher and chief data officer at the Massachusetts Institute of Technology’s Senseable City Lab, where he investigates the innovative implementation of big data analytics and artificial intelligence in smart cities, finance, and healthcare. A data scientist with expertise in developing predictive analytics strategies, machine learning solutions, and data-driven platforms for informed decision making, Behrooz endeavors to bridge the gap between academic research and industrial deployment of big data analytics and artificial intelligence. Behrooz also leads an unprecedented project on anonymized data fusion, which provides a multidimensional insight into urban activities and customer behaviors from multiple sources.

Presentations

Anonymized data fusion: Privacy versus utility Session

People are leaving an increasing amount of digital traces in their everyday life. Since these traces are mostly anonymized, the information gained by advanced data analytics is limited to each individual trace. Behrooz Hashemian explains how to fuse various traces and build a multidimensional insight by taking advantage of patterns in people's behavior.

Bill Havanki is a software engineer at Cloudera, where he contributes to Hadoop components and systems for deploying Hadoop clusters into public cloud services. Previously, Bill worked for 15 years developing software for government contracts, focusing mostly on analytic frameworks and authentication and authorization systems. He holds a BS in electrical engineering from Rutgers University and an MS in computer engineering from North Carolina State University. A New Jersey native, Bill currently lives near Annapolis, Maryland, with his family.

Presentations

Automating cloud cluster deployment: Beyond the book Session

Speed and reliability in deploying big data clusters is key for effectiveness in the cloud. Drawing on ideas from his book Moving Hadoop to the Cloud, which covers essential practices like baking images and automating cluster configuration, Bill Havanki explains how you can automate the creation of new clusters from scratch and use metrics gathered using the cloud provider to scale up.

Katherine Heller is an assistant professor in Duke University’s Departments of Statistical Science, Computer Science, and Electrical and Computer Engineering and at the Center for Cognitive Neuroscience, where she develops new methods and models to discover latent structure in data, including cluster structure, using Bayesian nonparametrics, hierarchical Bayes, time series techniques, and other Bayesian statistical methods, and applies these methods to problems in the brain and cognitive sciences, human social interactions, and clinical medicine. Previously, she was an NSF postdoctoral fellow in the Computational Cognitive Science group at MIT and an EPSRC postdoctoral fellow at the University of Cambridge. Katherine has been the recipient of a first-round NSF BRAIN Initiative award, a Google faculty research award, and an NSF CAREER award. She holds a PhD from the Gatsby Unit at University College London.

Presentations

Machine learning for healthcare data HDS

Katherine Heller discusses multiple ways in which healthcare data is acquired and explains how machine learning methods are currently being introduced into clinical settings.

Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic-net regularization in Spark’s ML library and one-pass elastic-net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

Presentations

Boosting Spark MLlib performance with rich optimization algorithms Session

Recent developments in Spark MLlib have given users the power to express a wider class of ML models and decrease model training times via the use of custom parameter optimization algorithms. Seth Hendrickson and DB Tsai explain when and how to use this new API and walk you through creating your own Spark ML optimizer. Along the way, they also share performance benefits and real-world use cases.

Extending Spark ML: Adding your own tools and algorithms Session

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau and Seth Hendrickson introduce Spark’s ML pipelines and explain how to extend them with your own custom algorithms. Even if you don't have your own algorithm to add, you'll leave with a deeper understanding of Spark's ML pipelines.

Lige Hensley is chief technology officer for Ivy Tech Community College of Indiana, where he leads a highly efficient and agile technical staff to bring a competitive advantage to the organization. A 24-year veteran of the IT industry, with experience ranging from successful startup companies to the Fortune 500, Lige has worked in a wide variety of industries, such as agriculture, military, entertainment, logistics, healthcare, government, education, manufacturing, telematics, and many more. He is an alumni of the Rose-Hulman Institute of Technology and brings a solid engineering background and a passion for innovation to every endeavor.

Presentations

Learning from higher education: How Ivy Tech is using predictive analytics and a data democracy to reverse decades of entrenched practices Session

As the largest community college in the US, Ivy Tech ingests over 100M rows of data a day. Brendan Aldrich and Lige Hensley explain how Ivy Tech is applying predictive technologies to establish a true data democracy—a self-service data analytics environment empowering thousands of users each day to improve operations, achieve strategic goals, and support student success.

JC Herz is cofounder of Ion Channel, a data and microservices platform that automates situational awareness and enables risk management of the software supply chain. She has 15 years of analytics experience in healthcare and national security. JC was a White House Special Consultant to the Pentagon’s CIO office and coauthored the DoD’s Open Technology Development roadmap. A published author, she has been contributing to Wired magazine since 1993.

Presentations

Confounding factors galore: Using software ecosystem data to risk rate code Session

Automating security for DevOps means continuous analysis of open source software dependencies, vulnerabilities, and ecosystem dynamics. But the data is confounding: a flurry of reported vulnerabilities or infrequent commits that could be good or a bad, depending on a project's scope and lifecycle. JC Herz illuminates nonintuitive insights from the software supply chain.

John Hitchingham is director of performance engineering at FINRA, where he is responsible for driving technical innovation and efficiency across a cloud application portfolio that processes over 75 billion market events per day to detect fraud, market manipulation, insider trading, and abuse. Previously, John worked at both large and boutique consulting firms providing technical design and consulting services to startup, media, and telecommunications clients. John holds a BS in electrical engineering from Rutgers University.

Presentations

Cloud data lakes: Analytic data warehouses in the cloud Session

John Hitchingham shares insights into the design and operation of FINRA's data lake in the AWS Cloud, where FINRA extracts, transforms, and loads over 75B transactions per day. Users can query across petabytes of data in seconds on AWS S3 using Presto and Spark—all while maintaining security and data lineage.

Vincent-Charles Hodder is the cofounder and CEO of Local Logic, an information company providing location insights on cities to help travelers, home buyers, and investors make better, more informed decisions. Vincent is passionate about cities, tech, and how they can work together to change the way we live. He has a background in finance and urban planning and worked in real estate development before starting Local Logic.

Presentations

Mapping cities through data to model risk in retail and real estate Findata

The location characteristics of a retail or real estate development dictate the types of customers they attract and the customer experience they deliver. Vincent-Charles Hodder explains how to model future demand for specific retail offerings and real estate projects as well as target marketing efforts to the most relevant locations in the city based on specific customer profiles.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

What can we learn from 750 billion GitHub events and 42 TB of code? Session

With Google BigQuery anyone can easily analyze the more than five years of GitHub metadata and 42+ terabytes of open source code. Felipe Hoffa explains how to leverage this data to understand the community and code related to any language or project. Relevant for open source creators, users, and choosers, this is data that you can leverage to make better choices.

Carla Holtze is the CEO and co-founder of Parrable, a digital identification SaaS company. Previous to Parrable, Carla worked at the BBC, The Economist Intelligence Unit and at Lehman Brothers in both New York and Hong Kong. Carla earned her MBA at Columbia Business School, MS in Journalism at Columbia University and BS at Northwestern University. Carla serves on the Advisory Board of the San Francisco Symphony Soundbox, Symphonix and is involved as mentor for emerging entrepreneurs and technology companies through Startup Mexico (SUM).

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

John Horcher is the CRO of Virtual Cove. John has extensive financial markets experience in trading, investment banking, and analyst roles. Previously, he held senior-level roles with firms including SunGard, Business Intelligence Advisors, TIM Group, EDS, and Intergraph and served as managing director of Halpern Capital, where he drove the investor base for research sales and investment banking opportunities, which included raising over $300 million in equity and debt.

Presentations

Discovering insights in financial data with immersive reality Session

Immersive reality enables powerful new information design concepts. Most importantly, the new technology enables the telling of powerful stories using more insightful thinking. John Horcher explores how immersive reality deployments in financial markets have enabled quicker time to insight and therefore better decision making.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and spends currently a lot of time writing a book, Stream Processing with Apache Flink.

Presentations

Stream analytics with SQL on Apache Flink Session

Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage.

Christine Hung leads the Data Solutions team at Spotify, which collaborates with business groups across the company to build scalable analytics solutions and provide strategic business insights. Previously, Christine ran the Data Science and Engineering team at the New York Times, where her team partnered closely with the newsroom to build audience development tools and predictive algorithms to drive performance; she was also head of sales analytics for iTunes at Apple and a business analyst at McKinsey & Company. Christine grew up in Taiwan and currently lives in Manhattan with her family. She holds an MBA from Stanford Business School.

Presentations

Music, the window into your soul Keynote

Have you ever wondered why Spotify just seems to know what you want? As a data-first company, Spotify is investing heavily in its analytics and machine learning capabilities to understand and predict user needs. Christine Hung shares how Spotify uses data and algorithms to improve user experience and drive business impact.

Alysa Z. Hutnik is a partner at Kelley Drye & Warren LLP in Washington, DC, where she delivers comprehensive expertise in all areas of privacy, data security and advertising law. Alysa’s experience ranges from counseling to defending clients in FTC and state attorneys general investigations, consumer class actions, and commercial disputes. Much of her practice is focused in the digital and mobile space in particular, including cloud, mobile payment, calling and texting practices, and big data-related services. Ranked as a leading practitioner in the privacy and data security area by Chambers USA, Chambers Global, and Law360, Alysa has received accolades for the dedicated and responsive service she provides to clients. The US Legal 500 notes that she provides “excellent, fast, efficient advice” regarding data privacy matters. In 2013, she was one of just three attorneys under 40 practicing in the area of privacy and consumer protection law to be recognized as a rising star by Law360.

Presentations

Executive Briefing: Legal best practices for making data work Session

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik shares legal best practices and practical tips to avoid becoming a big data “don’t.”

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of the ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Solving data cleaning and unification using human-guided machine learning Session

Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas provides insight into various techniques and discuss how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.

Pramod Immaneni is a PMC member of Apache Apex and lead architect at DataTorrent Inc, where he works on the Apex platform and specializes in big data applications. Prior to DataTorrent he was a founder of technology startups. He was CTO of Leaf Networks, a company he co-founded and was later acquired by Netgear Inc. He built products in the core networking space and holds patents in peer-to-peer VPNs. Before that he was involved in starting a company where he architected a dynamic content customization engine for mobile devices.

Presentations

Building a scalable streaming ingestion application with exactly once semantics using Apache Apex Session

Apache Apex is an open source stream processing platform that runs on Hadoop. Common usages of Apex is in big data ingestion, streaming analytics, ETL, fast batch, real-time actions, threat detection, etc. The talk will go into building an ingestion application with some lightweight etl that is scalable, fault tolerant and has exactly once semantics.

Exactly once, more than once: Apache Kafka, Heron, and Apache Apex Session

In a series of three 11-minute presentations, key members of Apache Kafka, Heron, and Apache Apex discuss their respective implementations of exactly once delivery and semantics.

Marta Jamrozik is CEO of Claire. Marta is an expert in product and price testing. Previously, she created the Strategic Pricing department at Fortune 100 CPG company Mondelēz, which was responsible for new product introductions and pricing across 165 countries, and worked in management consulting at Bain & Company. Marta was pursuing an MBA in Stanford’s Graduate School of Business before dropping out to work full-time on Claire.

Presentations

Retail's panacea: How machine learning is driving product development Session

In a panel discussion, Karen Moon, Jared Schiffman, and Marta Jamrozik explore how the retail industry is embracing data to include consumers in the design and development process, tackling the challenges associated with the wealth of sources and the unstructured nature of the data they handle and process and how the data is turned into insights that are digestible and actionable.

Nandu Jayakumar is a software architect and engineering leader at Visa, where he is currently responsible for the long-term architecture of data systems and leads the data platform development organization. Previously, as a senior leader of Yahoo’s well-regarded data team, Nandu built key pieces of Yahoo’s data processing tools and platforms over several iterations, which were used to improve user engagement on Yahoo websites and mobile apps. He also designed large-scale advertising systems and contributed code to Shark (SQL on Spark) during his time there. Nandu holds a bachelor’s degree in electronics engineering from Bangalore University and a master’s degree in computer science from Stanford University, where he focused on databases and distributed systems.

Presentations

Optimizing the data warehouse at Visa Session

At Visa, the process of optimizing the enterprise data warehouse and consolidating data marts by migrating these analytic workloads to Hadoop has played a key role in the adoption of the platform and how data has transformed Visa as an organization. Nandu Jayakumar and Ewa Ding share Visa’s journey along with some best practices for organizations migrating workloads to Hadoop.

Ling Jiang is a data scientist at the Washington Post, where she works on data mining and knowledge discovery from large volumes of data and has successfully built several data-powered products using machine learning and NLP techniques. Ling is skilled in using various machine learning and data mining techniques to tackle business problems. She holds a PhD in information science from Drexel University.

Presentations

Automatic comments moderation with ModBot at the Washington Post Session

The quality of online comments is critical to the Washington Post. However, the quality management of the comment section currently requires costly manual resources. Eui-Hong Han and Ling Jiang discuss Modbot, a machine learning-based tool developed for automatic comments moderation, and share the challenges they faced in developing and deploying ModBot into production.

Dave Kale is a PhD candidate in computer science and an Alfred E. Mann Innovation in Engineering fellow at the University of Southern California. His research uses machine learning to extract insight from digital data in high-impact domains, including, but not limited to, healthcare. His primary interest is in developing robust methods for learning meaningful representations of multivariate time series, especially using deep learning and related techniques. Dave is advised by Greg Ver Steeg of the USC Information Sciences Institute. He holds a BS in symbolic systems and an MS in computer science from Stanford University. Dave helps organize the annual Meaningful Use of Complex Medical Data (MUCMD) Symposium and is a cofounder of Podimetrics.

Presentations

Securely building deep learning models for digital health data Tutorial

Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You'll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.

Joseph has over ten years of experience teaching and over five years of experience data science and analytics. He has taught in over a dozen countries around the world and been featured on Japanese television and Saudi newspapers. He holds a BS in Electrical and Computer Engineering from Worcester Polytechnic Institute and an MBA with a focus in analytics from Bentley University. Previous to joining Databricks, Joseph was an instructor with Cloudera and Technical Sales Engineer with IBM. He is a rabid Arsenal FC supporter and competitive Magic: The Gathering player. He lives with his wife and daughter in Needham, MA.

Presentations

Apache Spark for machine learning and data science 2-Day Training

In this training course, you'll learn how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API.

Supun Kamburugamuve is a PhD candidate in computer science at Indiana University, where he researches big data applications and frameworks with a focus on data streaming for real-time data analytics. Recently, he has been working on high-performance enhancements to big data systems with HPC interconnect such as Infiniband and Omnipath. Supun is an Apache Software Foundation member and has contributed to many open source projects including Apache Web Services projects. Before joining Indiana University, Supun worked on middleware systems and was a key member of a team developing an open source enterprise service bus that is being widely used for enterprise integrations.

Presentations

Low-latency streaming: Twitter Heron on Infiniband Session

Modern enterprises are data driven and want to move at light speed. To achieve real-time performance, financial applications use streaming infrastructures for low latency and high throughput. Twitter Heron is an open source streaming engine with low latency around 14 ms. Karthik Ramasamy and Supun Kamburugamuvee explain how they ported Heron to Infiniband to achieve latencies as low as 7 ms.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

Interactive data exploration and analysis at enterprise scale Session

Sean Kandel shares best practices for building and deploying Hadoop applications to support large-scale data exploration and analysis across an organization.

Daniel Kang is a PhD student in the Stanford InfoLab, where he is supervised by Peter Bailis and Matei Zaharia. Daniel’s research interests lie broadly at the intersection of machine learning and systems. Currently, he is working on deep learning applied to video analysis.

Presentations

NoScope: Querying videos 1,000x faster with deep learning HDS

Video is one of the fastest-growing sources of data with rich semantic information, and advances in deep learning have made it possible query this information with near-human accuracy. However, inference remains prohibitively expensive: the most powerful GPU cannot run the state-of-the-art at real time. Daniel Kang offers an overview of NoScope, which runs queries over video 1,000x faster.

Holden Karau is a transgender Canadian Apache Spark committer, an active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo.

Presentations

Extending Spark ML: Adding your own tools and algorithms Session

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau and Seth Hendrickson introduce Spark’s ML pipelines and explain how to extend them with your own custom algorithms. Even if you don't have your own algorithm to add, you'll leave with a deeper understanding of Spark's ML pipelines.

Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where developed and open sourced techniques for anomaly detection and breakout detection. Prior research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Anomaly detection on live data Session

Services such as YouTube, Netflix, and Spotify popularized streaming in different industry segments, but these services do not center around live data—best exemplified by sensor data—which will be increasingly important in the future. Arun Kejariwal demonstrates how to leverage Satori to collect, discover, and react to live data feeds at ultralow latencies.

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and insights on how to address them.

Elsie Kenyon is a senior product manager at AI platform company Nara Logics, where she works with enterprise customers to define product needs and with engineers to build implementations that address them, with a focus on data processing and machine learning. Previously, Elsie was a researcher and casewriter at Harvard Business School. She holds a BA from Yale University.

Presentations

Learning from customers, keeping humans in the loop Session

Enterprises today pursue AI applications to replace logic-based expert systems in order to learn from customer and operational signals. But training data is often limited or nonexistent, and applying or extrapolating the wrong dataset can be costly to a company's business and reputation. Elsie Kenyon explains how to harness institutional human knowledge to augment data in deployed AI solutions.

Sander Kieft is the ICT architect at Sanoma Media, where he is responsible for the common services and performance-based titles within Sanoma. His team designs and builds (web) services for some of the largest websites and most popular mobile applications in the Netherlands, Belgium, and Finland. Sander has been working with large-scale data in media for 15 years and with Hadoop and big data platforms in production for nearly a decade. Previously, he was a developer, architect, and technology manager for some of the largest websites in the Netherlands.

Presentations

The pitfalls of running a self-service big data platform Session

Sanoma has been running big data as a self-service platform for over five years, mainly as a service for business analysts to work directly on the source data. The road to getting business analysts to directly do their analyses on Hadoop was far from smooth. Sander Kieft explores Sanoma's journey and shares some lessons learned along the way.

Kimoon Kim is a software engineer at Pepperdata. Kimoon has hands-on experience with large distributed systems processing massive datasets. Previously, he worked for the Google Search and Yahoo Search teams for many years.

Presentations

HDFS on Kubernetes: Lessons learned Session

There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.

James Kirkland is the advocate for Red Hat’s initiatives and solutions for the Internet of Things (IoT) and is the architect of Red Hat’s strategy for IoT deployments. This open source architecture combines data acquisition, integration, and rules activation with command and control data flows among devices, gateways, and the cloud to connect customers’ operational technology environments with information technology infrastructure and provide agile IoT integration. James serves as the head subject-matter expert and global team leader of system architects responsible for accelerating IoT implementations for customers worldwide. Through his collaboration with customers, partners, and systems integrators, Red Hat has grown its IoT ecosystem, expanding its presence in industries including transportation, logistics, and retail and accelerating adoption of IoT in large enterprises. His extensive knowledge Unix and Linux variants spans the course of his 20-year career at Red Hat, Racemi, and Hewlett-Packard. James is a steering committee member of the IoT working group for Eclipse.org, a member of the IIC, and a frequent public speaker and author of a wide range of technical topics.

Presentations

An open source architecture for the IoT Session

Eclipse IoT is an ecosystem of organizations that are working together to establish an IoT architecture based on open source technologies and standards. Dave Shuman and James Kirkland showcase an end-to-end architecture for the IoT based on open source standards, highlighting Eclipse Kura, an open source stack for gateways and the edge, and Eclipse Kapua, an open source IoT cloud platform.

Olivia Klose is a software development engineer in the Technical Evangelism and Development group at Microsoft, where she focuses on all analytics services on Microsoft Azure, in particular Hadoop (HDInsight), Spark, and Machine Learning. Olivia is a frequent speaker at conferences both in German and around the world, including TechEd Europe, PASS Summit, and Technical Summit. She studied computer science and mathematics at the University of Cambridge, the Technical University of Munich, and IIT Bombay, with a focus on machine learning in medical imaging.

Presentations

Deploying deep learning to assist the digital pathologist Session

Jon Fuller and Olivia Klose explain how KNIME, Apache Spark, and Microsoft Azure enable fast and cheap automated classification of malignant lymphoma type in digital pathology images. The trained model is deployed to end users as a web application using the KNIME WebPortal.

Keith Kohl is vice president of product management at Syncsort, where he is responsible for product management strategy, roadmap, and feature definition across Syncsort’s product portfolio. Keith has more than 16 years of data management market experience. Previously, Keith served as vice president of product management at Trillium Software, where he focused on Trillium’s global product strategy for enterprise data quality solutions, encompassing both established and emerging big data solutions, as deployed on-premises and in the cloud.

Presentations

A governance checklist for making your big data into trusted data (sponsored by Syncsort) Session

If users get conflicting analytics results, wild predictions, and crazy reports from the data in your data lake, they will lose trust. From the beginning of your data lake project, you need to build in solid business rules, data quality checking, and enhancement. Keith Kohl shares an actionable checklist that shows everyone in your enterprise that your big data can be trusted.

Priya Koul is the vice president of engineering for digital partnerships and closed-loop capabilities at American Express, where she leads key initiatives that transform its enterprise network information assets into innovative digital products and create unique value for customers in mobile and web applications and on partner platforms. Priya also leads the company’s end-to-end technology strategy, capability development, and technology platforms to launch innovative digital products and partnerships while advancing core platforms powering Amex’s closed-loop. Her team partners closely with several business and technology teams in driving the launch of key digital products all the way from ideation and product definition and design to deployment and ongoing management across all American Express markets. Priya has led the launch of several unique, groundbreaking, and industry-first digital products enabling strategic partnerships with Foursquare, Facebook, Twitter, Xbox, Apple, Samsung, and TripAdvisor. She also led the development and launch of key AXP network platforms such as Card SYNC, Smart Offers, and Tweet to Buy, leading the journey from payments to commerce. Her team launched the ability for American Express card members to pay with points in NYC taxi cabs, from within the Uber app, on BestBuy online, at Rite Aid, McDonald’s, and Chili’s restaurants, and on Airbnb. She also leads the global American Express SafeKey payer authentication capability that adds an extra layer of security for online shoppers and drives the advancement of strategic on network global platforms that power Amex’s Digital Offer ecosystem and campaigns like Small Business Saturday and Shop Small globally. Priya led the Digital Payment platform Amex Express Checkout across multiple online merchants, and her team was also responsible for building an API payment platform that facilitates B2B payments for large and middle-market payment partners. In addition to advancement and delivery of tech platforms, she also leads the community of technical practice for artificial intelligence across American Express. Priya’s core strengths include partnership building across internal and external partners. She places the highest importance on developing her team, fostering an innovative mindset, and collaboration. She is consistently recognized as a strategic leader with technical expertise and the ability to lead and motivate high-performing teams.

Presentations

AI at scale at American Express: Walking the talk Findata

The AI landscape is rapidly evolving, offering a lot of promise. . .and a lot of hype. Sastry Durvasula and Priya Koul explain how American Express is building an AI ecosystem at scale to unlock differentiated customer experiences and open up new business opportunities.

Chi-Yi Kuan is director of business analytics at LinkedIn. He has over 15 years of extensive experience in applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Presentations

The EOI framework for big data analytics to drive business impact at scale Session

Michael Li and Chi-Yi Kuan offer an overview of the EOI framework for big data analytics and explain how to leverage this framework to drive and grow business in key corporate functions, such as product, marketing, and sales.

Sanjeev Kulkarni is the cofounder of Streamlio, a company focused on building a next-generation real-time stack. Previously, he was the technical lead for real-time analytics at Twitter, where he cocreated Twitter Heron; worked at Locomatix handling the company’s engineering stack; and led several initiatives for the AdSense team at Google. Sanjeev holds an MS in computer science from the University of Wisconsin, Madison.

Presentations

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and insights on how to address them.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. Jared is the organizer of the New York Open Statistical Programming Meetup and the New York R Conference, as well as an adjunct professor of statistics at Columbia University, in addition to his client-facing consulting and training. Jared specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world. He was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Machine learning in R Tutorial

Modern statistics has become almost synonymous with machine learning—a collection of techniques that utilize today's incredible computing power. Jared Lander walks you through the available methods for implementing machine learning algorithms in R and explores underlying theories such as the elastic net, boosted trees, and cross-validation.

Scott is a Partner and Research Scientist at Uncharted with over twelve years of industry and academic experience. He holds a PhD in Computer Science with a background in machine learning. Scott’s research interests are in large scale visual analytics and adaptive user interfaces.

Presentations

Text analytics and new visualization techniques Session

Text analytics are advancing rapidly, and new visualization techniques for text are providing new capabilities. Richard Brath offers an overview of these new ways to organize massive volumes of text, characterize subjects, score synopses, and skim through lots documents.

Sam Lavigne is an editor at the New Inquiry and an instructor at NYU and the New School. An artist, programmer, and teacher, Sam has exhibited his work—which deals with data, cops, surveillance, natural language processing, and automation—at Rhizome, Flux Factory, Lincoln Center, SFMOMA, Pioneer Works, DIS, and the Smithsonian, among others.

Presentations

White Collar Crime Risk Zones Keynote

Sam Lavigne offers an overview of White Collar Crime Risk Zones, a predictive policing application that uses industry-standard predictive policing methodologies to predict financial crime at the city-block-level with an accuracy of 90.12%. Unlike typical predictive policing apps which criminalize poverty, White Collar Crime Risk Zones criminalizes wealth.

Reuven Lax is a senior staff software engineer at Google, the tech lead for cloud-based stream processing (i.e., the streaming engine behind Google Cloud Dataflow), and the former tech lead of MillWheel.

Presentations

Realizing the promise of portability with Apache Beam Session

Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. Reuven Lax offers an overview of Beam basic concepts and demonstrates the portability in action.

Francesca Lazzeri is a data scientist at Microsoft, where she is part of the Algorithms and Data Science team. Francesca is passionate about innovations in big data technologies and applications of advanced analytics to real-world problems. Her work focuses on the deployment of machine learning algorithms and web service-based solutions to solve real business problems for customers in the energy, retail, and HR analytics sectors. Previously, she was a postdoctoral research fellow in business economics at Harvard Business School. She holds a PhD in economics and management.

Presentations

Putting data to work: How to optimize workforce staffing to improve organization profitability Session

New machine learning technologies allow companies to apply better staffing strategies by taking advantage of historical data. Francesca Lazzeri and Hong Lu share a workforce placement recommendation solution that recommends staff with the best professional profile for new projects.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Pig. Julien was previously an architect at Dremio and the tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

The columnar roadmap: Apache Parquet and Apache Arrow Session

Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs are paving the way to breaking the silos of big data.

Toni LeTempt is a senior technical expert at Walmart. Toni has 18 years’ IT experience, five of them working with large secure enterprise Hadoop clusters.

Presentations

An authenticated journey through big data security at Walmart Session

In today’s world of data breaches and hackers, security is one of the most important components for big data systems, but unfortunately, it's usually the area least planned and architected. Matt Bolte and Toni LeTempt share Walmart's authentication journey, focusing on how decisions made early can have significant impact throughout the maturation of your big data environment.

Evan Levy is the vice president of data management programs at SAS, where he advises clients on strategies to address business challenges using data, technology, and creative approaches that align IT with the business capability and offers practical advice on addressing these challenges in a manner that utilizes a company’s existing skills, coupled with new methods to ensure IT and business success. A speaker, writer, and consultant in the areas of enterprise data strategy and data management, Evan is also a faculty member of TDWI as well as a best practices judge in the areas of business intelligence, data integration, and data management. Evan is the coauthor of the first book on MDM, Customer Data Integration: Reaching a Single Version of the Truth, which describes the business breakthroughs achieved with integrated customer data and explains how to make master data management successful.

Presentations

The five components of a data strategy Session

While few would argue the need for a organizations to have a comprehensive data strategy, few have actually developed a strategy and plan to address to improve the access, sharing, and usage of data. Evan Levy discusses the five essential components that make up a data strategy and explores the individual attributes of each.

Lisha is an investor at Amplify Partners. She focuses on companies that leverage machine learning and data to solve problems and she is excited to be investing at a time when algorithmic and data-driven methods have such incredible potential for impact.

Lisha completed her PhD at UC Berkeley focusing on deep learning and probability applied to the problem of clustering in graphs. She worked with Prof. David Aldous and Prof. Joan Bruna and was supported by the prestigious NSERC CGS fellowship. While at Berkeley she also did statistical consulting, advising on methods and analysis for experimentation and interpretation, and interned as a data scientist at Pinterest and Stitch Fix. She was the lecturer of discrete mathematics, as well as the graduate instructor for probability and statistics and intro CS theory. Prior to that, she earned her Master of Science in Mathematics at the University of Toronto, with Highest Distinction advised by Prof. Balazs Szegedy in the area of Graph Limits..

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Tianhui Michael Li is the founder and CEO of the Data Incubator. Michael has worked as a data scientist lead at Foursquare, a quant at D.E. Shaw and JPMorgan, and a rocket scientist at NASA. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup that lets him focus on what he really loves. He did his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall scholar.

Presentations

Machine learning with TensorFlow 2-Day Training

Dana Mastropole and Michael Li demonstrate TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

Dana Mastropole and Michael Li demonstrate TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Michael Li is head of analytics at LinkedIn, where he helps define what big data means for LinkedIn’s business and how it can drive business value through the EOI (Enable/Optimize/Innovate) analytics framework. Michael is passionate about solving complicated business problems with a combination of superb analytical skills and sharp business instincts. His specialties include building and leading high-performance teams quickly to meet the needs for fast-paced, growing companies. Michael has a number of years’ experience in big data innovation, business analytics, business intelligence, predictive analytics, fraud detection, analytics, operations, and statistical modeling across financial, ecommerce, and social networks.

Presentations

The EOI framework for big data analytics to drive business impact at scale Session

Michael Li and Chi-Yi Kuan offer an overview of the EOI framework for big data analytics and explain how to leverage this framework to drive and grow business in key corporate functions, such as product, marketing, and sales.

Zhichao Li is a senior software engineer at Intel focused on distributed machine learning, especially large-scale analytical applications and infrastructure on Spark. He’s also an active contributor to Spark. Previously, Zhichao worked in Morgan Stanley’s FX department.

Presentations

Building advanced analytics and deep learning on Apache Spark with BigDL Session

Yuhao Yang and Zhichao Li discuss building end-to-end analytics and deep learning applications, such as speech recognition and object detection, on top of BigDL and Spark and explore recent developments in BigDL, including Python APIs, notebook and TensorBoard support, TensorFlow model R/W support, better recurrent and recursive net support, and 3D image convolutions.

Julia Lintern is a senior data scientist at Metis, where she coteaches the data science bootcamp, develops curricula, and focuses on various other special projects. Previously, Julia worked as a data scientist at JetBlue, where she used quantitative analysis and machine learning methods to provide continuous assessment of the aircraft fleet. Julia began her career as a structures engineer designing repairs for damaged aircraft. In her free time, she collaborates on various projects, such as the development of a trap music generator; she has also worked on creative side projects such as Lia Lintern, her own fashion label. Julia holds an MA in applied math from Hunter College, where she focused on visualizations of various numerical methods including collocation and finite element methods and discovered a deep appreciation for the combination of mathematics and visualizations, leading her to data science as a natural extension of these ideas.

Presentations

A deep dive into deep learning with Keras Tutorial

Julia Lintern offers a deep dive into deep learning with Keras, beginning with basic neural nets and then winding our way through to convolutional neural nets and recurrent neural nets. Along the way, Julia explains both the design theory and the Keras implementations of today's most widely used deep learning algorithms.

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine-learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Presentations

A brave new world in mutable big data: Relational storage Session

To date, mutable big data storage has primarily been the domain of nonrelational (NoSQL) systems such as Apache HBase. However, demand for real-time analytic architectures has led big data back to a familiar friend: relationally structured data storage systems. Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu.

Ryan Lippert is a senior product marketing manager at Cloudera, where he is responsible for the company’s Operational Database offering and for marketing its storage products. Previously, Ryan served in a variety of roles at Cisco Systems. He holds an economics degree from the University of Guelph and an MBA from Stanford.

Presentations

The sunset of lambda: New architectures amplify IoT impact Session

A long time ago in a data center far, far away, we deployed complex lambda architectures as the backbone of our IoT solutions. Though hard, they enabled collection of real-time sensor data and slightly delayed analytics. Michael Crutcher and Ryan Lippert explain why Apache Kudu, a relational storage layer for fast analytics on fast data, is the key to unlocking the value in IoT data.

Julie Lockner is cofounder of 17 Minds Corporation, a startup focusing on improving care and education plans for children with special needs. She has held executive roles at InterSystems, Informatica, and EMC and was an analyst at ESG. She was founder and CEO of CentricInfo, a data management consulting firm. Julie holds an MBA from MIT and a BSEEfrom WPI.

Presentations

Predicting tantrums with wearable data and real-time analytics Session

How can we empower individuals with special needs to reach their full potential? Julie Lockner offers an overview of a project to develop collaboration applications that use wearable device data to improve the ability to develop the best possible care and education plans. Join in to learn how real-time IoT data analytics are making this possible.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Hardcore Data Science welcome HDS

Hosts Ben Lorica and Assaf Araki welcome you to Hardcore Data Science day.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Piet Loubser is vice president of product and solutions marketing at Hortonworks, where he is responsible for the holistic positioning of the Hortonworks product and solution portfolio. Piet has more than 25 years of experience in the IT industry driving strategic marketing, product marketing, sales and software development and has worked with organizations across the globe and developed deep experience on using data to drive strategic transformations. Previously, he headed up platform product marketing at Informatica; held executive marketing roles at HP and SAP; and led portfolio market strategies at Business Objects (acquired by SAP), where he also held numerous positions in regional sales management and business development in offices within the United States, Europe, and South Africa. Piet holds a bachelor’s degree in computer science from the University of Stellenbosch in South Africa. 

Presentations

Powering business outcomes with data science in a connected world (sponsored by Hortonworks) Session

Data has become the new fuel of business success. As a result, business intelligence and analytics are among the top priorities for CIOs today. Piet Loubser outlines the tectonic shift currently taking place in the market and explains why next-gen connected architectures are crucial to meet the demands of an intelligent, connected world.

Hong Lu is a data scientist at Microsoft. Hong is passionate about innovations in big data technologies and application of advanced analytics to real-world problems. During her time at Microsoft, Hong has built end-to-end data science solutions for customers in energy, retail, and education sectors. Previously, she worked on optimizing advertising platforms in the video advertising industry. Hong holds a PhD in biomedical engineering from Case Western Reserve University, where her research focused on machine learning-based medical image analysis.

Presentations

Putting data to work: How to optimize workforce staffing to improve organization profitability Session

New machine learning technologies allow companies to apply better staffing strategies by taking advantage of historical data. Francesca Lazzeri and Hong Lu share a workforce placement recommendation solution that recommends staff with the best professional profile for new projects.

Zhenxiao Luo is a software engineer at Uber working on Presto and Parquet. Previously, he led the development and operations of Presto at Netflix and worked on big data and Hadoop-related projects at Facebook, Cloudera, and Vertica. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.

Presentations

Executive panel: Big data and the cloud Down Under Session

Major companies in Australia and New Zealand, including Air New Zealand, Westpac, and ANZ, have been pioneering the adoption of big data technologies like Hadoop. In a panel moderated by Steve Totman, senior execs from these companies share use cases, challenges, and how to be successful Down Under, on the opposite side of the world from where technologies like Hadoop got started.

Geospatial big data analysis at Uber Session

Uber's geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao Luo and Wei Yan explain how Uber runs geospatial analysis efficiently in its big data systems, including Hadoop, Hive, and Presto.

Thiruvalluvan M G is vice president of engineering at Aqfer. Previously, Thiru was a distinguished architect at Yahoo; principal hacker at Altiscale; and an architect at Stata Labs, where he built desktop search engine Bloomba. He also held a number of technical and managerial engineering roles at Accel and Hewlett-Packard. He is a committer and PMC member of the Apache Avro project. Thiru holds a BE in electronics and communications engineering from Anna University.

Presentations

SETL: An efficient and predictable way to do Spark ETL Session

Common ETL jobs used for importing log data into Hadoop clusters require considerable amount of resources and that varies based on the input size. Thiruvalluvan M G shares a set of techniques—involving an innovative use of Spark processing and exploiting features of Hadoop file formats—that not only make these jobs much more efficient but also work well with fixes amount of resources.

Allan MacInnis is a solutions architect at Amazon Web Services, where he works on streaming data and analytics and helps AWS customers build solutions that enable them to gain immediate insight into their business and operations. Allan has held a number of roles at Amazon, including software development manager, where he helped to build innovative new products such as the Amazon Kindle and Amazon Flex. Previously, he spent several years as a software developer and architect at Dell. Allan holds a degree in electrical engineering from Dalhousie University.

Presentations

Creating a serverless real-time analytics platform powered by machine learning in the cloud Session

Speed matters. Today, decisions are made based on real-time insights, but in order to support the substantial growth of streaming data, companies are required to innovate. Roy Ben-Alta and Allan MacInnis explore AWS solutions powered by machine learning and artificial intelligence.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera helping clients find success with the Hadoop ecosystem and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Ted Malaska, Gwen Shapira, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Bruce Martin is a senior instructor at Cloudera, where he teaches courses on data science, Apache Spark, Apache Hadoop, and data analysis. Previously, Bruce was principle architect and director of advanced concepts at SunGard Higher Education, where he developed the software architecture for SunGard’s Course Signals Early Intervention System, which uses machine learning algorithms to predict the success of students enrolled in university courses. Bruce’s other roles have included senior staff engineer at Sun Microsystems and researcher at Hewlett-Packard Laboratories. Bruce has written many papers on data management and distributed system technologies and frequently presents his work at academic and industrial conferences. Bruce has authored patents on distributed object technologies. Bruce holds a PhD and master’s degree in computer science from the University of California at San Diego and a bachelor’s degree in computer science from the University of California, Berkeley.

Presentations

Cloudera big data architecture workshop 2-Day Training

This training brings together technical contributors in a group setting to design and architect solutions to a challenging business problem. You'll explore big data application architecture concepts in general and then apply them to the design of a challenging system.

Cloudera Big Data Architecture Workshop (Day 2) Training Day 2

The Cloudera Big Data Architecture Workshop (BDAW) is a 2-day leaning event that addresses advanced big data architecture topics. BDAW brings together technical contributors into a group setting to design and architect solutions to a challenging business problem. The workshop addresses big data architecture problems in general, and then applies them to the design of a challenging system.

Hilary Mason is founder and CEO of Fast Forward Labs, a machine intelligence research company, and data scientist in residence at Accel Partners. Previously Hilary was chief scientist at Bitly. She cohosts DataGotham, a conference for New York’s homegrown data community, and cofounded HackNY, a nonprofit that helps engineering students find opportunities in New York’s creative technical economy. Hilary served on Mayor Bloomberg’s Technology Advisory Board and is a member of Brooklyn hacker collective NYC Resistor.

Presentations

Executive Briefing: Talking to machines—Natural language today Session

Progress in machine learning has led us to believe we might soon be able to build machines that talk to us using the same interfaces that we use to talk to each other: natural language. But how close are we? Hilary Mason explores the current state of natural language technologies and some applications where this technology is thriving today and imagines what we might build in the next few years.

Dana Mastropole is a data scientist in residence at the Data Incubator and contributes to curriculum development and instruction. Previously, Dana taught elementary school science after completing MIT’s Kaufman teaching certificate program. She studied physics as an undergraduate student at Georgetown University and holds a master’s in physical oceanography from MIT.

Presentations

Machine learning with TensorFlow 2-Day Training

Dana Mastropole and Michael Li demonstrate TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

Dana Mastropole and Michael Li demonstrate TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Murthy Mathiprakasam is a director of product marketing for Informatica’s big data products, where he is responsible for outbound marketing activities. Murthy has a decade and a half of experience working with emerging high-growth software technologies, including roles at Mercury Interactive/HP, Google, eBay, VMware, and Oracle. Murthy holds an MS in management science from Stanford University and BS degrees in management science and computer science from the Massachusetts Institute of Technology.

Presentations

Using an AI-driven approach to managing data lakes in the cloud or on-premises (sponsored by Informatica) Session

In the face of regulatory and competitive pressures, why not use artificial intelligence, along with smart best practices, to manage data lakes? Murthy Mathiprakasam shares a comprehensive approach to data lake management that ensures that you can quickly and flexibly ingest, cleanse, master, govern, secure, and deliver all types of data in the cloud or on-premises.

Andy is the co-founder and CEO of Automat, a Conversational Marketing platform that uses AI to allow companies to have personalized one-on-one messaging conversations with their customers to better understand and serve them. The cofounders and team collectively have over 50 years experience and 17 patents in the fields of speech recognition, natural language understanding, virtual assistants and AI. Automat has received investments from You & Mr Jones, Comcast Ventures, Relay and Real Ventures, the Slackbot fund, USAA, and Omidyar Technology Ventures . Their advisory board consists of technology trend spotter Tim O’Reilly, former Chief Creative Officer of TellMe and Nuance Gary Clayton, Richard Socher Chief Scientist at Salesforce, conversational commerce pioneer Chris Messina, and Adobe board member Amy Banse.

Presentations

Executive Briefing: Conversational marketing for brands—Why it's better to talk to your customers than monitor them Session

In this briefing Andy Mauro will explain why the last 15 years of digital marketing has really been about monitoring customers and how recent advancements in artificial intelligence and the dominance of messaging as the primary consumer channel provide an opportunity to achieve every marketers dream of simply talking the their customers.

Tony McAllister is the director of enterprise architecture at Be the Match, part of the National Marrow Donor Program, where he and his team design and build technology solutions that deliver cellular therapy solutions to patients in need of a transplant. The team is currently building a real-time, distributed computing search engine on the Hadoop platform to find the best donor match from a global registry of over 30 million donors.

Presentations

Implementing Hadoop to save lives Session

The National Marrow Donor Program (Be the Match) recently moved its core transplant matching platform onto Cloudera Hadoop. Tony McAllister explains why the program chose Cloudera Hadoop and shares its big data goals: to increase the number of donors and matches, make the process more efficient, and make transplants more effective.

Michael McCune is a software developer in Red Hat’s emerging technology group, where he develops and deploys application for cloud platforms. He is an active contributor to several radanalytics.io projects and is a core reviewer for the OpenStack API working group. Previously, Michael developed Linux-based software for embedded global positioning systems.

Presentations

From notebooks to cloud native: A modern path for data-driven applications Session

Notebook interfaces like Apache Zeppelin and Project Jupyter are excellent starting points for sketching out ideas and exploring data-driven algorithms, but where does the process lead after the notebook work has been completed? Michael McCune offers some answers as they relate to cloud-native platforms.

Jason McIntyre is Accenture’s Digital Ecosystem Alliance management lead.

Presentations

Executive Briefing: Data ecosystem strategy Session

Whether you are a technology or a services provider, understanding your value in the ecosystem and focusing on the right partners to reach your market goals is critical. Jason McIntyre and Mark Milazzo share examples of teaming models and leading practices for accelerating value from your ecosystem strategy.

Tim McKenzie is general manager of big data solutions at Pitney Bowes, where he leads a global team dedicated to helping clients unlock the value that is hidden in the massive amounts of data collected about customers, infrastructure and products. With over 17 years of experience engaging with customers about technology, Tim has a proven track record of delivering value in every engagement.

Presentations

Big data, location analytics, and geoenrichment to drive better business outcomes (sponsored by Pitney Bowes) Session

Organizations need to have a data strategy that includes the tools to derive location intelligence, enhance existing data with geographic enrichment (geoenrichment), and perform location analytics to reveal strategic and operational insights. Tim McKenzie shares new data quality and location intelligence approaches that operate natively within Hadoop and Spark environments.

Fiona has been described as a pioneer in the field of analytics. She has worked alongside organizations in virtually every industry, including some of the largest global organizations, helping them derive tangible benefit from the strategic use of technology to real-world business scenarios. With SAS for over 19 years, Fiona is currently focused on new, emerging analytical technology and is a known speaker, author and innovator in the field of analytics. Fiona co-authored ‘Heuristics in Analytics’ and is a member of the Cognitive Computing Consortium working group.

Presentations

Meeting the Challenges of the Analytics Economy (Sponsored by SAS) Session

Much is being written about the economy of everything. Where does the analytics economy fit in? The analytics economy offers a chance to disrupt traditional business models and requires realization of value from the application of analytics to data. We’ll explore the unique challenges and intersections with related technologies like machine learning, deep learning cognitive computing, and more.

Matteo Merli is a software engineer at Streamlio, where he works on messaging and storage technologies. Previously, he spent several years building database replication systems and multitenant messaging platforms at Yahoo. Matteo was the architect and lead developer for Pulsar and is a member of the PMC of Apache BookKeeper.

Presentations

Messaging, storage, or both: The real-time story of Pulsar and Apache DistributedLog Session

Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. Matteo Merli and Sijie Guo offer an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production.

Mark is responsible for developing and managing the Accenture Insights Platform Partner Program. He has been with Accenture for 4 years, but is a > 20+ year veteran of the technology market, with the last 15 year focused on developing and managing alliance and vendor management programs.

Presentations

Executive Briefing: Data ecosystem strategy Session

Whether you are a technology or a services provider, understanding your value in the ecosystem and focusing on the right partners to reach your market goals is critical. Jason McIntyre and Mark Milazzo share examples of teaming models and leading practices for accelerating value from your ecosystem strategy.

Chris Mills is big data lead at if(we). Chris has been coding since grade school. Unable to choose between science and engineering, he has spent his career working on projects incorporating both fields, in genetics, natural language processing, distance learning, content syndication, automated categorization, and recommender systems. Chris loves games and puzzles of all sorts and thinks that the intersection of big data and human behavior offers some of the very best puzzles available.

Presentations

Lessons from an AWS migration Session

if(we)'s batch event processing pipeline is different from yours, but the process of migrating it from running in a data center to running in AWS is likely pretty similar. Chris Mills explains what was easier than expected, what was harder, and what the company wished it had known before starting the migration.

Harjinder Mistry is a member of the Developer Tools team at Red Hat, where he is incorporating data science into next-generation developer tools powered by Spark. Previously, he was a member of IBM’s Analytics team, where he developed Spark ML Pipelines components for the IBM Analytics platform, and spent several years on the DB2 SQL Query Optimizer team building and fixing the mathematical model that decides the query execution plan. Harjinder holds an MTech from IIIT, Bangalore, India.

Presentations

AI-driven next-generation developer tools Session

Bargava Subramanian and Harjinder Mistry explain how machine learning and deep learning techniques are helping Red Hat build smart developer tools to make software developers become more efficient.

Karen Moon is cofounder and CEO of Trendalytics, a style-centric visual data platform that measures consumer engagement with merchandise trends. Karen has more than 12 years of experience in retail and technology working with companies across the supply chain, including department stores, luxury retailers, and independent designers. Previously, she executed Goode Partners’s investment in SkullCandy and worked on the turnaround of a luxury specialty retaile; worked in Gap’s Corporate Strategy group, where she assessed acquisition and new retail concept opportunities such as Piperlime.com; and held positions at Goldman Sachs, where she executed over $1 billion in technology and media transactions. She’s been featured in the Wall Street Journal, Forbes, and other publications. Karen holds an MBA from Harvard Business School and a BA (summa cum laude) from UCLA. Her research at Harvard included studies in multichannel retailing, luxury diffusion brands, and supply chain innovation for emerging designers. Karen initially pursued a BA in fashion design at Otis College of Art & Design.

Presentations

Retail's panacea: How machine learning is driving product development Session

In a panel discussion, Karen Moon, Jared Schiffman, and Marta Jamrozik explore how the retail industry is embracing data to include consumers in the design and development process, tackling the challenges associated with the wealth of sources and the unstructured nature of the data they handle and process and how the data is turned into insights that are digestible and actionable.

.

Presentations

Detecting a spoofing overlay Tutorial

Regulators increasingly require market participants to self-monitor to prevent manipulative practices such as spoofing and layering. Jason Morton shares methods for detecting a spoofing overlay on top of a legitimate strategy from a flow of time-stamped order message and cancellation data.

Karthikeyan Nagalingam has extensive experience architecting Hadoop solutions in the cloud, hybrid cloud, and on premises. He has developed numerous proof of concepts, worked with customers on deploying Hadoop solutions, and spoken at many industry, customer, and partner events. He holds a patent for Distributed data storage and processing techniques. Additionally, he has extensive experience with deploying and developing in Linux Environments. Currently, Karthik is a Senior Technical Marketing Engineer at NetApp responsible for defining and developing Big Data Analytics data protection technologies, producing best practices documentation, and helping customers implement Hadoop and NoSQL solutions. He holds a MS in Software Systems from Birla Institute of Technology and Science and a Bachelors of Engineering from SriRam Engineering College.

Presentations

Key Big Data Architectural Considerations for Deploying in the Cloud and On-Premises (Sponsored by NetApp) Session

When analytics applications become business critical, balancing cost with SLAs for performance, backup, dev/test and recovery can be challenging. In this session, we will discuss these Big Data architectural challenges and propose solutions that allow you to create a cost-optimized solution for rapid deployment of business-critical applications that meet corporate SLAs today and into the future.

Milind Nagnur is the managing director and head of CTO data services at Citi, where he and his team deliver strategic solutions to clients focused on revenue discovery, regulatory compliance, and business performance transformation leveraging digital and data analytics. Milind has more than 18 years of business and IT experience across data, IT strategy, architecture, infrastructure, and application development, as well as experience managing large, complex global transformation initiatives, such as Citi’s next-generation Enterprise Analytics Platform (EAP 2.0). Milind designs and builds large-scale, data-oriented, and high-performing technology platforms, strategically connecting technical capabilities with business value and generating successful product solutions and intellectual property for Citi and has been a leader and mentor for several notable programs, such as WLC’s DTP program. He serves on the advisory board for Citi Ventures for the firm’s data and analytic investments. Previously, Milind was the principal architect and apps development group manager in Trade and Treasury Services at JPMorgan Chase and the systems integration consultant for financial services clients at Pricewaterhouse Coopers. Milind is project sponsor, mentor, and senior advocate for Citi’s Women’s Leadership Council’s Developing Talent Program (DTP) and Emerging Talent Program (ETP). He holds a BTech in mechanical engineering from the Indian Institute of Technology, Mumbai and an MBA in finance and computer information systems from the Indian Institute of Management, Calcutta.

Presentations

Next Generation Data Management Session

Milind discusses next generation platform , from controlled exploratory sandboxes to hosting transactional applications, and discusses how modern, industry-leading data management tools & self-service analytics can solve this requirement.

Paco Nathan leads the Learning group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

PyTextRank: Graph algorithms for enhanced natural language processing Session

Paco Nathan demonstrates how to use PyTextRank—an open source Python implementation of TextRank that builds atop spaCy, datasketch, NetworkX, and other popular libraries to prepare raw text for AI applications in media and learning—to move beyond outdated techniques such as stemming, n-grams, or bag-of-words while performing advanced NLP on single-server solutions.

Heather Nelson is a senior solution architect at Silicon Valley Data Science, where she draws from her diverse background in business and technology consulting to find the best solutions for her clients’ toughest data problems. A problem solver by nature, Heather is passionate about helping organizations leverage data to drive competitive advantage.

Presentations

Managing data science in the enterprise Tutorial

John Akred and Heather Nelson share methods and observations from three years of effectively deploying data science in enterprise organizations. You'll learn how to build, run, and get the most value from data science teams and how to work with and plan for the needs of the business.

Chris Neumann is a venture partner at 500 Startups focused on big data, machine learning, and AI. Previously, Chris was the founder and CEO of DataHero (acquired by Cloudability), which brought to market the first self-service cloud BI platform, and the first employee at Aster Data (acquired by Teradata), where he helped create the big data space.

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

Alan is co-founder and CTO of Rasa, the leading open source conversational AI company. Rasa builds software that enables developers to build conversational software that really works, and is trusted by thousands of developers in enterprises worldwide, including UBS, ERGO, and Helvetia. Rasa combines applied AI research with enterprise-ready technology. Alan holds a PhD in machine learning from the University of Cambridge and has years of experience building AI-powered products in industry.

Presentations

Deep Learning for understanding language and holding conversations HDS

There's a large body of research on machine learning-based dialogue, but most voice and chat systems in production are still implemented using a state machine and a set of rules. Alan will talk about Rasa's applied AI research in language understanding and dialogue, and how open source implementations bring the state of the art to thousands of developers.

Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Ryan Nienhuis, Radhika Ravirala, and Dario Rivera walk you through building a big data application using a combination of open source technologies and AWS managed services. Along they way, they also cover architecture design patterns and best practices for big data applications.

Jack Norris is the senior vice president of data and applications at MapR Technologies. Jack has a wide range of demonstrated successes, from defining new markets for small companies to increasing sales of new products for large public companies, in his 20 years spent in enterprise software marketing. Jack’s broad experience includes launching and establishing analytics, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity, Brio Technology, SQRIBE, and Bain & Company. Jack earned an MBA from UCLA’s Anderson School of Management and a BA in economics with honors and distinction from Stanford University.

Presentations

The Essentials for Digital Growth (Sponsored by MapR) Session

How you approach Cloud, IoT, machine learning, and governance are interrelated and can result in a series of successes or obstacles and failures. Discover the lessons learned by leading companies leveraging data to transform customer experiences, operational results, and overall growth.

Cathy O’Neil a data scientist for the startup media company Intent Media. Cathy began her career as a postdoc in MIT’s Math department. She has been a professor at Barnard College, where she published a number of research papers in arithmetic algebraic geometry, worked as a quant for the hedge fund D.E. Shaw in the middle of the credit crisis and for RiskMetrics, a risk software company that assesses risk for the holdings of hedge funds and banks. Cathy holds a PhD in math from Harvard.

Presentations

Weapons of math destruction Keynote

Cathy O'Neil exposes the mathematical models that shape our future, both as individuals and as a society. These “weapons of math destruction” score teachers and students, sort résumés, grant (or deny) loans, evaluate workers, target voters, set parole, and monitor our health.

Brian O’Neill is the founder and consulting product designer at Designing for Analytics, where he focuses on helping companies design indispensable data products that customers love. Brian’s clients and past employers include Dell EMC, NetApp, TripAdvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JPMorgan Chase, the Future of Music Coalition, and E*TRADE, among others; over his career, he has worked on award-winning storage industry software for Akorri and Infinio. Brian has been designing useful, usable, and beautiful products for the web since 1996. He cofounded the adventure travel company TravelDragon.com and has invested in several Boston-area startups. When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage as a professional percussionist and drummer. He leads the acclaimed dual-ensemble, Mr. Ho’s Orchestrotica, which the Washington Post called “anything but straightforward,” and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival.

Presentations

Design for nondesigners: Increasing revenue, usability, and utility within data analytics products Session

Do you spend a lot of time explaining your data analytics product to your customers? Is your UI/UX or navigation overly complex? Are sales suffering due to complexity, or worse, are customers not using your product? Your design may be the problem. Brian O'Neill shares a secret: you don't have to be a trained designer to recognize design and UX problems and start correcting them today.

Tim O’Reilly has a history of convening conversations that reshape the computer industry. In 1998, he organized the meeting where the term “open source software” was agreed on and helped the business world understand its importance. In 2004, with the Web 2.0 Summit, he defined how “Web 2.0” represented not only the resurgence of the web after the dot-com bust but a new model for the computer industry, based on big data, collective intelligence, and the internet as a platform. In 2009, with his “Gov 2.0 Summit,” Tim framed the conversation about the modernization of government technology that has shaped policy and spawned initiatives at the federal, state, and local levels and around the world. He has now turned his attention to implications of the on-demand economy, AI, robotics, and other technologies that are transforming the nature of work and the future shape of the economy. Tim is the founder and CEO of O’Reilly Media and a partner at O’Reilly AlphaTech Ventures (OATV). He sits on the boards of Maker Media (which was spun out from O’Reilly Media in 2012), Code for America, PeerJ, Civis Analytics, and POPVOX.

Presentations

WTF? What's the future and why it's up to us Keynote

Robots are going to take our jobs, they say. Tim O'Reilly says, "Only if that's what we ask them to do!" Tim has had his fill of technological determinism. He explains why technology is the solution to human problems, and we won't run out of work till we run out of problems.

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of @WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Rick Okin is vice president of data engineering for JW Player, the world’s largest network-independent video platform, where he is responsible for building innovative data products to expand JW Player’s extensive footprint. A data-driven technology expert with more than 30 years of experience in the information technology industry, Rick previously served as CTO for actionable advertising intelligence provider Integral Ad Science, where he was responsible for managing all aspects of technology and system operations.

Presentations

How JW Player is powering the online video revolution with data analytics (sponsored by Snowflake Computing) Session

Rick Okin explains how JW Player strategically leverages video data analytics to power industry- and customer-level insights for the evolving online video space.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: Machine learning—Why you need it, why it's hard, what to do about it Session

Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.

Francois Orsini is currently the Chief Technology Officer for MZ’s Satori business unit. Francois believed in the CEO, Gabriel Leydon’s vision and joined MZ in January 2010 as Vice President of Platform Engineering and Chief Architect, bringing his expertise in building server-side architecture and implementation of next-gen social and server platform. From 2005 – 2010, Francois served as database architect and evangelist at Sun Microsystems. In addition, he has more than 25+ Years’​ experience in OLTP database systems, middleware and real-time infrastructure development at companies like Oracle, Sybase and Cloudscape. Francois has extensive experience working with database and infrastructure development, honing his expertise in distributed data management systems, scalability, security, resource management, HA cluster solutions, soft real-time and connectivity services. He also collaborated with Visa International and Visa USA, to implement the first Visa Cash Virtual ATM for the internet and founded a VC-backed startup called Unikala in 1999. Francois holds a Bachelors Degree in Civil Engineering and Computer Sciences from the Paris Institute of Technology.

Presentations

Anomaly detection on live data Session

Services such as YouTube, Netflix, and Spotify popularized streaming in different industry segments, but these services do not center around live data—best exemplified by sensor data—which will be increasingly important in the future. Arun Kejariwal demonstrates how to leverage Satori to collect, discover, and react to live data feeds at ultralow latencies.

Joel Östlund is a senior data engineer in research and development at Spotify. Previously, Jeol was a data and backend engineer at a national security company, a researcher at Ericsson, and a data engineering consultant in Gurgaon, India, and in Italy. He holds an MS in industrial engineering and management with a specialization in computer science from Linköping University, Sweden, and National Chiao Tung University, Taiwan.

Presentations

Managing core data entities for internal customers at Spotify Session

Spotify makes data-driven product decisions. As the company grows, the magnitude and complexity of the data it cares for the most is rapid increasing. Sneha Rao and Joel Östlund walk you through how Spotify stores and exposes audience data created from multiple internal producers within Spotify.

Andrew Otto is a systems engineer at the Wikimedia Foundation, where he supports the Analytics team by architecting and maintaining small and big data analytics infrastructure. Previously, Andrew was the lead systems administrator at CouchSurfing.org. He is based in Brooklyn, NY, and spends too much time playing hardcourt bike polo.

Presentations

Analytics at Wikipedia Session

The Wikimedia Foundation (WMF) is a nonprofit charitable organization. As the parent company of Wikipedia, one of the most visited websites in the world, WMF faces many unique challenges around its ecosystem of editors, readers, and content. Andrew Otto and Fangjin Yang explain how the WMF does analytics and offer an overview of the technology it uses to do so.

Jon is part of the Control-M Innovation IT team in North America, dedicated to supporting any efforts to help integrate Control-M into the DevOps, Big Data, and Cloud markets, providing support to customers working with Control-M in this respect.

Jon has been working with Control-M since 2010, administering it, scheduling and operating various environments. His specialties lie with Control-M, ETL processes, System Administration (Linux and Windows), as well as programming (C++, Java, Python, Bash to name a few).

Presentations

Automated data pipelines in hybrid environments: Myth or reality? (sponsored by BMC) Session

Are you building, running, or managing complex data pipelines across hybrid environments spanning multiple applications and data sources? Doing this successfully requires automating dataflows across the entire pipeline, ideally controlled through a single source. Basil Faruqui walks you through a customer journey to automate data pipelines across a hybrid environment.

Shoumik Palkar is a second-year PhD student in the Infolab at Stanford University, working with Matei Zaharia on high-performance data analytics. He holds a degree in electrical engineering and computer science from UC Berkeley.

Presentations

Weld: Accelerating data science by 100x Session

Modern data applications combine functions from many optimized libraries (e.g., pandas and TensorFlow) and yet do not achieve peak hardware performance due to data movement across functions. Shoumik Palkar and Matei Zaharia offer an overview of Weld, a new interface to implement functions in these libraries while enabling optimizations across them.

Lloyd Palum is the CTO of Vnomics, where he directs the company’s technology development associated with optimizing fuel economy in commercial trucking. Lloyd has more than 25 years of experience in both commercial and government electronics, has published a number of technical articles and speaks frequently at industry conferences. He holds five patents in the field of software and wireless communications. Lloyd earned his MSEE from Boston University and BSEE from the University of Rochester.

Presentations

How to build a digital twin Session

A digital twin models a real-world physical asset using mobile data, cloud computing and machine learning to track chosen characteristics. Lloyd Palum walks you through building a tractor trailer digital twin using Python and TensorFlow. You can then use the example model to track and optimize performance.

Gene Pang is a software engineer at Alluxio. Previously, he worked at Google. Gene recently earned his PhD from the AMPLab at UC Berkeley, working on distributed database systems, and holds an MS from Stanford University and a BS from Cornell University.

Presentations

Best practices for using Alluxio with Spark Session

Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for managing data across different storage. Many deployments use Alluxio with Spark because Alluxio helps Spark further accelerate applications. Bin Fan and Gene Pang explain how Alluxio makes Spark more effective and share production deployments of Alluxio and Spark working together.

Kevin Parent is the CEO of Conduce, a company that helps leaders and teams see and interact with all their data instantly using a single, intuitive human interface. An innovator, Kevin’s entire career has focused on connecting the dots between advances in technology and human experiences. Previously, he cofounded Oblong Industries, where he invented new–to–the-world interfaces that allow users to interact with software using displays, gestures, wands, tablets, and smartphones, and spent 10 years engineering theme park attractions. (He was a project engineer for the Twilight Zone Tower of Terror at Walt Disney Imagineering.) Kevin is the author of six patents. He holds a degree in physics from the Massachusetts Institute of Technology, where his undergraduate thesis work was conducted in MIT’s Media Lab.

Presentations

Seeing everything so managers can act on anything: The IoT in DHL Supply Chain operations Session

DHL has created an IoT initiative for its supply chain warehouse operations. Javier Esplugas and Kevin Parent explain how DHL has gained unprecedented insight—from the most comprehensive global view across all locations to a unique data feed from a single sensor—to see, understand, and act on everything that occurs in its warehouses with immersive operational data visualization.

Robert Passarella has spent over 20 years on Wall Street in the gray zone between business and technology. Rob has always focused on leveraging technology and innovative information sources to empower novel ideas in research and the investment process. A veteran of Morgan Stanley, JPMorgan, Bear Stearns, Dow Jones, and Bloomberg; he has seen the transformational challenges first hand, up close and personal. Currently, Rob evaluates AI and machine-learning investment managers for Protege Partners. Always intrigued by the consumption and use of information for investment analysis, Rob is passionate about leveraging alternative and unstructured data for use with machine-learning techniques. Rob holds an MBA from the Columbia Business School.

Presentations

Findata welcome Tutorial

Alistair Croll and Rob Passarella welcome you to Findata Day.

Mo Patel is a practice director for AI and deep learning at Teradata, where he mentors and advises Teradata clients and provides guidance on ongoing deep learning projects. Mo has successfully managed and executed data science projects with clients across several industries, including cable, auto manufacturing, medical device manufacturing, technology, and car insurance. Previously, Mo was a management consultant and a software engineer. A continuous learner, Mo conducts research on applications of deep learning, reinforcement learning, and graph analytics toward solving existing and novel business problems and brings a diversity of educational and hands-on expertise connecting business and technology. He holds an MBA, a master’s degree in computer science, and a bachelor’s degree in mathematics.

Presentations

Deep learning for recommender systems Tutorial

Ron Bodkin and Mo Patel demonstrate how to apply deep learning to improve consumer recommendations by training neural nets to learn categories of interest for recommendations using embeddings. You'll also learn how to achieve wide and deep learning with WALS matrix factorization—now used in production for the Google Play store.

Bob Patterson is a certified master IT architect and chief strategist at HPE, where he focuses on enterprise data, analytics ,and the internet of things. As a member of HPE’s Strategic Solutions Architecture (SSA) team, Bob works with customers, industries, account teams, partners, and HPE product and services teams to drive opportunities, activities, and initiatives around data analytics and business intelligence. Previously, Bob spent 20 years as a systems engineer, consultant, IT specialist, and certified senior IT architect at IBM, where he was responsible for the design, development, and implementation of global solutions for IBM customers. He was also a member of the IBM Global Technology Services Architecture Board responsible for designing reference architectures for server consolidation and virtualization, infrastructure interoperability, and cloud computing. Bob currently guest lectures for the School of Engineering at Robert Morris University and volunteers for several nonprofit organizations. He has two patents in cloud implementation. Bob holds a BS in mechanical engineering from Carnegie Mellon University and MS degrees in computer information systems and telecommunications from the University of Denver; he also holds professional certifications in ITIL and HPE’s ExpertOne Planning and Design of Business Critical Systems.

Presentations

A comprehensive, enterprise-grade, open Hadoop solution from Hewlett Packard Enterprise (sponsored by Hewlett Packard Enterprise) Session

Bob Patterson offers an overview of Hewlett Packard Enterprise's enterprise-grade Hadoop solution, which has everything you need to accelerate your big data journey: innovative hardware architectures for diverse workloads certified for all leading distros, infrastructure software, services from HPE and partners, and add-ons like object storage.

Josh Patterson is the director of field engineering for Skymind. Previously, Josh ran a big data consultancy, worked as a principal solutions architect at Cloudera, and was an engineer at the Tennessee Valley Authority, where he was responsible for bringing Hadoop into the smart grid during his involvement in the openPDC project. Josh is a cofounder of the DL4J open source deep learning project and is a coauthor of Deep Learning: A Practitioner’s Approach. Josh has over 15 years’ experience in software development and continues to contribute to projects such as DL4J, Canova, Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif. Josh holds a master’s degree in computer science from the University of Tennessee at Chattanooga, where he did research in mesh networks and social insect swarm algorithms.

Presentations

Real-time image classification: Using convolutional neural networks on real-time streaming data Session

Enterprises building data lakes often have to deal with very large volumes of image data that they have collected over the years. Josh Patterson and Kirit Basu explain how some of the most sophisticated big data deployments are using convolutional neural nets to automatically classify images and add rich context about the content of the image, in real-time, while ingesting data at scale.

Securely building deep learning models for digital health data Tutorial

Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You'll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.

Joshua Patterson is the director of applied solutions engineering at NVIDIA. Previously, Josh worked with leading experts across the public and private sectors and academia to build a next-generation cyberdefense platform. He was also a White House Presidential Innovation Fellow. His current passions are graph analytics, machine learning, and GPU data acceleration. Josh also loves storytelling with data, and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina Moore School of Business.

Presentations

Training a deep learning risk detection platform Session

Joshua Patterson and Michael Balint explain how to bootstrap a deep learning framework to detect risk and threats in production operational systems using best-of-breed GPU-accelerated open source tools.

Nick Pentreath is a principal engineer at IBM working primarily on machine learning on Apache Spark. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a member of the Apache Spark PMC and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Deep learning for recommender systems Session

In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. Nick Pentreath explores recent advances in this area in both research and practice.

Sander Pick is CTO at Set, an on-device machine learning platform that aims to embed user intelligence into every mobile application. Previously, Sander worked at Apple and Mission Motors. A Montanan, Sander likes focus, climbing, and open spaces.

Presentations

Learning location: Real-time feature extraction for mobile analytics Session

Location-based data is full of information about our everyday lives, but GPS and WiFi signals create extremely noisy mobile location data, making it hard to extract features, especially when working with real-time data. Carson Farmer and Sander Pick explore new strategies for extracting information from location data while remaining scalable, privacy focused, and contextually aware.

Mike Pittaro is a distinguished engineer at Dell EMC, where he works on big data cluster architectures. Mike has a background in high-performance computing, data warehousing, and distributed systems and has held engineering and service positions at Alliant Computer, Kendall Square Research, Informatica, and SnapLogic.

Presentations

Considerations for hardware-accelerated machine learning platforms Session

The advances we see in machine learning would be impossible without hardware improvements, but building a high-performance hardware platform is tricky. It involves hardware choices, an understanding of software frameworks and algorithms, and how they interact. Mike Pittaro shares the secrets of matching the right hardware and tools to the right algorithms for optimal performance.

Adrian Popescu is a data engineer at Unravel Data Systems working on performance profiling and optimization of Spark applications. He has more than eight years of experience building and profiling data management applications. He holds a PhD in computer ecience from EPFL, where his thesis focused on modeling the runtime performance of a class of analytical workloads that include iterative tasks executing on in-memory graph processing engines (Giraph BSP), and SQL queries executing at scale on Hive, a master of applied science from the University of Toronto, and a bachelor of science from University Politehnica, Bucharest.

Presentations

Using ML to solve failure problems with ML and AI apps in Spark Session

A roadblock in the agility that comes with Spark is that application developers can get stuck with application failures and have a tough time finding and resolving the issue. Adrian Popescu and Shivnath Babu explain how to use the root cause diagnosis algorithm and methodology to solve failure problems with ML and AI apps in Spark.

Sean Power is the founder of Repable.

Presentations

Data science and e-sports DCS

Data case study with Sean Power

Syed Rafice is a senior system engineer at Cloudera, where he specializes in big data on Hadoop technologies and is responsible for designing, building, developing, and assuring a number of enterprise-level big data platforms using the Cloudera distribution. Syed also focuses on both platform and cybersecurity. He has worked across multiple sectors, including government, telecoms, media, utilities, financial services, and transport.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Greg is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over 20 years, Greg has worked with relational database systems across a variety of roles, including software engineering, database administration, database performance engineering, and most recently product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Rethinking data marts in the cloud: Common architectural patterns for analytics Session

Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from UW Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Exactly once, more than once: Apache Kafka, Heron, and Apache Apex Session

In a series of three 11-minute presentations, key members of Apache Kafka, Heron, and Apache Apex discuss their respective implementations of exactly once delivery and semantics.

Low-latency streaming: Twitter Heron on Infiniband Session

Modern enterprises are data driven and want to move at light speed. To achieve real-time performance, financial applications use streaming infrastructures for low latency and high throughput. Twitter Heron is an open source streaming engine with low latency around 14 ms. Karthik Ramasamy and Supun Kamburugamuvee explain how they ported Heron to Infiniband to achieve latencies as low as 7 ms.

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and insights on how to address them.

Twitter Heron Goes Exactly Once Session

Twitter processes billions of events per day at the instant the data is generated. To achieve real-time performance, Twitter employs Heron, an open-source streaming engine tailored for large-scale environments. In this talk, Karthik will present the techniques used by Heron to implement exactly once and share the operating experiences at scale.

Jun Rao is the cofounder of Confluent, a company that provides a stream data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM’s Almaden research data center, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Apache Cassandra.

Presentations

A deep dive into Apache Kafka core internals Session

Over the last few years, streaming platform Apache Kafka has been used extensively for real-time data collecting, delivering, and processing—particularly in the enterprise. Jun Rao leads a deep dive on some of the key internals that help make Kafka popular and provide strong reliability guarantees.

Exactly once, more than once: Apache Kafka, Heron, and Apache Apex Session

In a series of three 11-minute presentations, key members of Apache Kafka, Heron, and Apache Apex discuss their respective implementations of exactly once delivery and semantics.

Introducing Exactly Once Semantics in Apache Kafka Session

Apache Kafka’s rise in popularity as a streaming platform has demanded a revisit of its traditional at least once message delivery semantics. In this talk, we'll present the recent additions to Apache Kafka to achieve exactly once semantics. We'll also discuss the newly introduced transactional APIs and use the Kafka Streams API as an example to show how these APIs are leveraged for streams tasks.

Sneha Rao is an experienced product owner at Spotify, where she works with big data at scale. Previously, Sneha worked at the New York Times, Comcast/NBCUniversal, and NASA’s data center. She is skilled in database management, big data, analytics, and Python and is currently pursuing a MBA focused on innovation, design, and entrepreneurial studies at New York University’s Leonard N. Stern School of Business.

Presentations

Managing core data entities for internal customers at Spotify Session

Spotify makes data-driven product decisions. As the company grows, the magnitude and complexity of the data it cares for the most is rapid increasing. Sneha Rao and Joel Östlund walk you through how Spotify stores and exposes audience data created from multiple internal producers within Spotify.

Faraz Rasheed is the senior manager of big data analytics team at TD Bank, Canada, where he leads a team of data scientists and data engineers empowering TD Bank with advanced analytics. He also teaches data science related courses at Ryerson University. Previously, Faraz was a senior data scientist at BlackBerry, where he led the data science team. Faraz has special interest in modernizing data science project development with innovative and impactful solutions to entrench data science practices within business teams. Faraz holds a PhD in machine learning from University of Calgary.

Presentations

Griffin: Fast-tracking model development in Hadoop Session

Steven Totman and Faraz Rasheed offer an overview of Griffin, a high-level, easy-to-use framework built on top of Spark, which encapsulates the complexities of common model development tasks within four phases: data understanding, feature extraction, model development, and serving modeling results.

Pranav Rastogi is a program manager on Microsoft’s Azure HDInsight team. Pranav spends most of his time making it easier for customers to leverage the big data ecosystem to build big data solutions faster.

Presentations

Building big data applications on Azure Tutorial

As big data solutions are rapidly moving to the cloud, it's becoming increasingly important to know how to use Apache Hadoop, Spark, R Server, and other open source technologies in the cloud. Pranav Rastogi walks you through building big data applications on Azure HDInsight and other Azure services.

Alex Ratner is a third-year PhD student at the Stanford InfoLab working under Chris Re. Alex works on new machine learning paradigms for settings where limited or no hand-labeled training data is available, motivated in particular by information extraction problems in domains like genomics, clinical diagnostics, and political science. He coleads the development of the Snorkel framework for lightweight information extraction.

Presentations

Data programming: Creating large training sets quickly HDS

As data-hungry algorithms become the norm in machine learning, the bottleneck is now acquiring labeled training data. Alex Ratner explores data programming, a paradigm for the programmatic creation of training sets in which users express weak supervision strategies or domain heuristics as simple scripts called labeling functions, which are then automatically denoised.

Radhika Ravirala is a solutions architect at Amazon Web Services, where she helps customers craft distributed, robust cloud applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. Radhika enjoys spending time with her family, walking her dog, doing Warrior X-Fit, and playing an occasional hand at Smash Bros.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Ryan Nienhuis, Radhika Ravirala, and Dario Rivera walk you through building a big data application using a combination of open source technologies and AWS managed services. Along they way, they also cover architecture design patterns and best practices for big data applications.

As CIBC’s Chief Data Officer (CDO), Jose Ribau leads the bank’s data management strategy and advanced analytics functions. Jose’s team is responsible for governing the use of strategic data, and driving transformation of the business through delivery of client segmentation, predictive modeling and analytics projects — all with a focus on producing insights that help capture growth through product consolidation and increased share of wallet.

Prior to this role, Jose worked in Client Analytics and Product Development at CIBC where he contributed to Canada’s first Visa Debit launch and other client-focused solutions that paved the way for additional revenue streams. Early in his career, Jose spent several years as a researcher at McMaster University’s Medical Sciences division where he completed his Master of Science degree. His experience wrangling clinical data sets has prepared him well for leading CIBC’s data science and advanced analytics practice.

Jose also holds an MBA from Queen’s University and BSc from Wilfrid Laurier University. He enjoys spending quality time with his family, is an avid cyclist and a big Star Wars fan.

Presentations

Fintech, data innovation, and the real world Findata

José Ribau discusses the pragmatic side of data-driven finance—the realities of modern banking—comparing the demands of governance and compliance to the aspirations of fintech startups.

Salema Rice is the chief data officer at Allegis Group, where she is responsible for enterprise-wide data and analytics, including data management technology, big data, enterprise data operations, global master data management, enterprise data governance, business intelligence, enterprise information management, insights, and analytics.

Presentations

Differentiating ourselves with data and analytics DCS

Salema Rice shares how Allegis Group, the largest privately held talent management company in the world, is transforming into a digitally enhanced company by using big data and data sciences to differentiate itself in the marketplace.

Dario Rivera is a solutions architect at Amazon Web Services, where he helps customers to get the most out of AWS. A 20-year IT veteran, Dario has also worked widely within the public sector, holding positions within the DOD, FBI, DHS, and DEA. From highly available, scalable, and elastic architectures to complex enterprise systems with zero down-time availability, Dario is always on the lookout for a challenge to change the world through customer success. Dario has presented at conferences and venues around the world, including Re:Invent, Strata + Hadoop World, HIMMS, and Oxford University.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Ryan Nienhuis, Radhika Ravirala, and Dario Rivera walk you through building a big data application using a combination of open source technologies and AWS managed services. Along they way, they also cover architecture design patterns and best practices for big data applications.

Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.

Presentations

Rethinking data marts in the cloud: Common architectural patterns for analytics Session

Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance.

Matthew Roche is a senior program manager in Microsoft’s Cloud and Enterprise group, where he focuses on enterprise information management, crowdsourced metadata, and data source discovery. Matthew currently delivers capabilities in Azure Data Catalog; previously, he worked on Power BI, SQL Server Integration Services, Master Data Services, and Data Quality Services. When not enabling the world to get more value from its data, he enjoys reading, baking, and competitive longsword combat.

Presentations

Building a Rosetta Stone for business data Session

The data-driven business must bridge the language gap between data scientists and business users. Matthew Roche and Jennifer Stevens walk you through building a business glossary that codifies your semantic layer and enables greater conversational fluency between business users and data scientists.

Matthew Rocklin is an open source software developer at Continuum Analytics focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.

Presentations

Dask: Flexible parallelism in Python for advanced analytics Session

Dask parallelizes Python libraries like NumPy, pandas, and scikit-learn, bringing a popular data science stack to the world of distributed computing. Matthew Rocklin discusses the architecture and current applications of dask used in the wild and explores computational task scheduling and parallel computing within Python generally.

Scaling Python data analysis Tutorial

The Python data science stack, which includes NumPy, pandas, and scikit-learn, is efficient and intuitive but only for in-memory data and a single core. Matthew Rocklin and Ben Zaitlen demonstrate how to parallelize and scale your Python workloads to multicore machines and multimachine clusters.

Julie Rodriguez is vice president of product management and user experience at Eagle Investment Systems. An experience designer focusing on user research, analysis, and design for complex systems, Julie has patented her work in data visualizations for MATLAB, compiled a data visualization pattern library, and publishes industry articles on user experience and data analysis and visualization. She is the coauthor of Visualizing Financial Data, a book about visualization techniques and design principles that includes over 250 visuals depicting quantitative data.

Presentations

Data visualizations decoded Data 101

Designing data visualizations presents unique and interesting challenges: how to tell a compelling story, how to deliver important information in a forthright, clear format, and how to make visualizations beautiful and engaging. Julie Rodriguez shares a few disruptive designs and connects them back to Vizipedia, her compiled data visualization library.

Expanding data literacy with data visualizations Session

While the value of data and its role in informing decisions and communications is well known, its meaning can be incorrectly interpreted without data visualizations that provide context and accurate representation of the underlying numbers. Julie Rodriguez shares new approaches and visual design methods that provide a greater perspective of the data.

Steve Ross is a senior product manager at Cloudera, where he focuses on product management for security across the Hadoop ecosystem, championing the interests of users and IT teams working to get the most out of Hadoop while complying with the demands of information security and compliance requirements. Previously, at RSA Security and Voltage Security, Steve managed product portfolios now in use by the largest global companies and over a hundred million users.

Presentations

GDPR: Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Steven Ross and Mark Donsky outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Edgar Ruiz is a solutions engineer at RStudio with a background in deploying enterprise reporting and business intelligence solutions. He is the author of multiple articles and blog posts sharing analytics insights and server infrastructure for data science. Recently, Edgar authored the “Data Science on Spark using sparklyr” cheat sheet.

Presentations

Using R and Spark to analyze data on Amazon S3 Session

With R and sparklyr, a Spark standalone cluster can be used to analyze large datasets found in S3 buckets. Edgar Ruiz walks you through setting up a Spark standalone cluster using EC2, S3 bucket folder and file setup, connecting R to Spark, the settings needed to read S3 data into Spark, and a data import and wrangle approach.

Philip Russom is the research director for data management at TDWI, where, as an industry analyst, he oversees many of the company’s research-oriented publications, services, and events. A well-known figure in data warehousing, business intelligence, data management, and analytics, Philip has published 550+ research reports, magazine articles, opinion columns, speeches, and webinars. Previously, he was an industry analyst covering BI at Forrester Research and Giga Information Group; ran his own business as an independent industry analyst and BI consultant and was contributing editor with leading IT magazines; and held technical and marketing positions for various database vendors.

Presentations

The data lake: Improving the role of Hadoop in data-driven business management Session

Philip Russom explains how a data lake can improve the role of Hadoop in data-driven business management. With the right end-user tools, a data lake can enable self-service data practices that wring business value from big data and modernize and extend programs for data warehousing, analytics, data integration, and other data-driven solutions.

Neelesh Srinivas Salian is a software engineer on the Data Platform Infrastructure team in the Algorithms group at Stitch Fix, helps build the ecosystem around Apache Spark. Previously, he worked at Cloudera where he worked with Apache projects like YARN, Spark, and Kafka. He holds a Master’s degree in Computer Science with a focus on Cloud Computing from North Carolina State University and a Bachelor’s degree in Computer Engineering from the University of Mumbai, India.

Presentations

Apache Spark in the hands of data scientists Session

Neelesh Srinivas Salian offers an overview of the data platform used by data scientists at Stitch Fix, based on the Spark ecosystem. Neelesh explains the development process and shares some lessons learned along the way.

Majken Sander is a data nerd, business analyst, and solution architect at TimeXtender. Majken has worked with IT, management information, analytics, BI, and DW for 20+ years. Armed with strong analytical expertise, She is keen on “data driven” as a business principle, data science, the IoT, and all other things data.

Presentations

Show me my data, and I’ll tell you who I am. Session

Personal data is increasingly spread across various services globally. But what do companies know about me? And how do we collect that knowledge, get ahold of our own data, and maybe even correct faulty perceptions by putting the right answers out there as a service? Majken Sander explains why we desperately need a personal Discovery Hub: a go-to place for knowledge about ourselves.

Sri Satish is cofounder and CEO of H2O.ai, the builders of H2O. H2O democratizes big data science and makes Hadoop do math for better predictions. Previously, Sri spent time scaling R over big data with researchers at Purdue and Stanford; cofounded Platfora; was the director of engineering at DataStax; served as a partner and performance engineer at the Java multicore startup Azul Systems, where he tinkered with the entire ecosystem of enterprise apps at scale; and worked on a NoSQL trie-based index for semistructured data at in-memory index startup RightOrder. Sri is known for his knack for envisioning killer apps in quickly evolving spaces and assembling stellar teams to productize that vision. He is a regular speaker on the big data, NoSQL, and Java circuit and leaves a trail at @srisatish.

Presentations

Interpretable AI: Not just for regulators Session

Interpreting deep learning and machine learning models is not just another regulatory burden to be overcome. People who use these technologies have the right to trust and understand AI. Patrick Hall and Sri Satish share techniques for interpreting deep learning and machine learning models and telling stories from their results.

Andrei Savu is a software engineer at Cloudera, where he’s working on Cloudera Director, a product that makes Hadoop deployments in cloud environments easy and more reliable for customers.

Presentations

A deep dive into running data engineering workloads in AWS Tutorial

Jennifer Wu, Andrei Savu, Vinithra Varadharajan, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.

Jared Schiffman is the founder of Perch Interactive Inc, a startup intent on revolutionizing the retail environment. Jared has worked at the intersection of design, computer science, and education for over two decades. His work fuses the physical world with the digital world and plays with relationship between the two, his projects are steeped in metaphor and gesture and emphasize the power of direct experience. Jared is the cofounder of Potion, an interactive design and technology firm located in New York City named one of the top 10 most innovative design companies in 2010 by Fast Company; has taught courses at Parsons School of Design, New York University, and at the Gate’s-funded High Tech High in San Diego. Jared holds a master’s degree in media arts and science from the MIT Media Lab, where he studied with John Maeda in the Aesthetics and Computation Group, and an SB in computer science and engineering from MIT.

Presentations

Retail's panacea: How machine learning is driving product development Session

In a panel discussion, Karen Moon, Jared Schiffman, and Marta Jamrozik explore how the retail industry is embracing data to include consumers in the design and development process, tackling the challenges associated with the wealth of sources and the unstructured nature of the data they handle and process and how the data is turned into insights that are digestible and actionable.

William Schmarzo is the CTO of Dell EMC’s Big Data practice, where he is responsible for setting the strategy and defining the service line offerings and capabilities for the EMC Consulting Enterprise Information Management and Analytics service line. Bill has more than two decades of experience in data warehousing, BI, and analytics applications. He authored the Business Benefits Analysis methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements and has served on the Data Warehouse Institute’s faculty as the head of the analytic applications curriculum. Previously, Bill was the vice president of analytics at Yahoo, where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of actionable insights through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing, and sales of their industry-defining analytic applications. Bill is the author of Big Data: Understanding How Data Powers Big Business, has written several whitepapers, and coauthored a series of articles on analytic applications with Ralph Kimball. He is a frequent speaker on the use of big data and advanced analytics to power an organization’s key business initiatives. Bill holds a master’s degree in business administration from the University of Iowa and a bachelor of science degree in mathematics, computer science, and business administration from Coe College.

Presentations

Executive Briefing: Determining the economic value of your data (EvD) Session

Organizations need a process and supporting frameworks to become more effective at leveraging data and analytics to transform their business models. Using the Big Data Business Model Maturity Index as a guide, William Schmarzo demonstrates how to assess business value and implementation feasibility with respect to the monetization potential of an organization’s business use cases.

Jacob Schreiber is a third-year CSE PhD student and IGERT big data fellow at the University of Washington. Jacob is a core developer for the popular Python machine learning package sklearn and the author of a probabilistic modeling Python package pomegranate.

Presentations

Pomegranate: Flexible probabilistic modeling for Python HDS

Jacob Schreiber offers an overview of pomegranate, a flexible probabilistic modeling package implemented in Cython for speed. Jacob explores the models it supports, such as Bayesian networks and hidden Markov models, and how to easily implement them and explains how the underlying modular implementation unlocks several benefits for the modern data scientist.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for six years.

Presentations

How to leverage the cloud for business solutions Data 101

The cloud is becoming pervasive, but it isn’t always full of rainbows. Defining a strategy that works for your company or for your use cases is critical to ensuring success. Jim Scott shares use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to move your data between locations.

Jonathan Seidman is a software engineer on the Partner Engineering team at Cloudera. Previously, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Ted Malaska, Gwen Shapira, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Nick Selby is a Texas police detective focused on investigating computer fraud and child exploitation and a cybersecurity incident responder. A frequent contributor to newspapers including the Washington Post and New York Times, Nick is also the coauthor of Cyber Survival Manual: From Identity Theft to The Digital Apocalypse and Everything in Between, In Context: Understanding Police Killings of Unarmed Civilians, and Blackhatonomics: Understanding the Economics of Cybercrime and the technical editor of Investigating Internet Crimes: An Introduction to Solving Crimes in Cyberspace.

Presentations

The context of contacts: Seeking root causes of racial disparity in Texas traffic-summons fines DCS

Nick Selby offers an overview of his study on traffic-stop data in Texas, which found evidence that the state targeted low-income residents (a disproportional number of whom are black and Latino) for heightened scrutiny and penalties. The problem is not necessarily an issue of racist cops—which means fixing the criminal justice system isn’t just an issue of addressing racism in uniform.

Viral Shah is the cofounder and CEO of Julia Computing and a cocreator of the Julia language, as well as other open source software. Previously, he drove the rearchitecting of the government’s social security systems in India as part of the national ID project, Aadhaar. Viral is the coauthor of Rebooting India.

Presentations

Julia and Spark, better together Session

Spark is a fast and general engine for large-scale data. Julia is a fast and general engine for large-scale compute. Viral Shah explains how combining Julia's compute and Spark's data processing capabilities makes amazing things possible.

Shaked Shammah is a graduate student at The Hebrew University working under Shai Shalev-Shwartz and a researcher at Mobileye Research. Shaked’s work focuses on general machine learning and optimization, specifically the theory and practice of deep learning and reinforcement learning.

Presentations

Failures of gradient-based deep learning HDS

Deep learning is amazing, but it sometimes fails miserably, even for very simple, practical problems. Shaked Shammah discusses four types of common problems in which deep learning fails. Some can be solved by using specific approaches to network architecture and loss functions. For others, deep learning is simply not the right way to go.

Tushar Shanbhag is head of data strategy and data products at LinkedIn. Tushar is a seasoned executive with track record of building high growth businesses at market-defining companies such as LinkedIn, Cloudera, VMware, and Microsoft. Most recently, Tushar was vice president of products and design at Arimo, an Andreessen-Horowitz company building data intelligence products using analytics and AI.

Presentations

Taming the ever-evolving compliance beast: Lessons learned at LinkedIn Session

Shirshanka Das and Tushar Shanbhag explore the big data ecosystem at LinkedIn and share its journey to preserve member privacy while providing data democracy. Shirshanka and Tushar focus on three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement platform, and a unified data access layer.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Ted Malaska, Gwen Shapira, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

One cluster does not fit all: Architecture patterns for multicluster Apache Kafka deployments Session

There are many good reasons to run more than one Kafka cluster…and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions to help you better choose the right architecture for your needs.

The three realities of modern programming: The cloud, microservices, and the explosion of data Session

Gwen Shapira explains how the three realities of modern programming—the explosion of data and data systems, building business processes as microservices instead of monolithic applications, and the rise of the public cloud—affect how developers and companies operate today and why companies across all industries are turning to streaming data and Apache Kafka for mission-critical applications.

Ben Sharma is CEO and cofounder of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions and expertise ranging from development to production deployment in a wide array of technologies, including Hadoop, HBase, databases, virtualization, and storage. He has held technology leadership positions for NetApp, Fujitsu, and others. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes. He holds two patents.

Presentations

Operationalizing your data lake: The key to a scalable, modern data architecture (sponsored by Zaloni) Session

With the majority of Hadoop implementations failing to deliver business value today, it is critical to add a big data management platform to operationalize, automate, and optimize your data lake for success. Ben Sharma shares lessons learned from the field, a reference architecture, and a set of considerations for management, operations, security, and governance.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Unraveling data with Spark using deep learning and other algorithms from machine learning Tutorial

Vartika Singh and Jeffrey Shmain walk you through various approaches using the machine learning algorithms available in Spark ML to understand and decipher meaningful patterns in real-world data. Vartika and Jeff also demonstrate how to leverage open source deep learning frameworks to run classification problems on image and text datasets leveraging Spark.

Dave Shuman is a subject-matter expert at Cloudera. Dave has an extensive background in business intelligence applications, database architecture, logical and physical database design, and data warehousing. Previously, Dave held a number of roles at Vision Chain, a leading demand signal repository provider enabling retailer and manufacturer collaboration, including chief operations officer, vice president of field operations responsible for customer success and user adoption, vice president of product responsible for product strategy and messaging, and director of services. He also served at top CG companies as Kraft Foods, PepsiCo, and General Mills, where he was responsible for implementations; was vice president of operations for enews, an ecommerce company acquired by Barnes and Noble; was executive vice president of management information systems, where he managed software development, operations, and retail analytics; and developed ecommerce applications and business processes used by Barnesandnoble.com, Yahoo, and Excite, and pioneered an innovative process for affiliate commerce. He holds an MBA with a concentration in information systems from Temple University and a BA from Earlham College.

Presentations

An open source architecture for the IoT Session

Eclipse IoT is an ecosystem of organizations that are working together to establish an IoT architecture based on open source technologies and standards. Dave Shuman and James Kirkland showcase an end-to-end architecture for the IoT based on open source standards, highlighting Eclipse Kura, an open source stack for gateways and the edge, and Eclipse Kapua, an open source IoT cloud platform.

Tanvi Singh is the chief analytics officer, CCRO, at Credit Suisse, where she leads a team of 15+ data scientists and analytics SME globally in Zurich, New York, London, and Singapore that is responsible for delivering multimillion dollar projects in big data with leading Silicon Valley vendors in the space of regulatory technology (regtech). Tanvi has 18 years of experience managing big data analytics, SAP business intelligence, data warehousing, digital analytics, and Siebel CRM platforms, with a focus on statistics, machine learning, text mining, and visualizations. She also has experience in quality as a Lean Six Sigma Black Belt. Tanvi holds a master’s degree in software systems from the University of Zurich.

Presentations

Findata session with Tanvi Singh Findata

Details to come.

Keynote with Tanvi Singh Keynote

Details to come.

Vartika Singh is a solutions architect at Cloudera with over 10 years of experience applying machine learning techniques to big data problems.

Presentations

Securely building deep learning models for digital health data Tutorial

Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You'll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.

Unraveling data with Spark using deep learning and other algorithms from machine learning Tutorial

Vartika Singh and Jeffrey Shmain walk you through various approaches using the machine learning algorithms available in Spark ML to understand and decipher meaningful patterns in real-world data. Vartika and Jeff also demonstrate how to leverage open source deep learning frameworks to run classification problems on image and text datasets leveraging Spark.

Ben Snively is a specialist solutions architect on the Amazon Web Services Public Sector team, where he specializes in big data, analytics, and search. Previously, Ben was an engineer and architect on DOD contracts, where he worked with Hadoop and big data solutions. He has over 11 years of experience creating analytical systems. Ben holds both a bachelor’s and master’s degree in computer science from Georgia Institute of Technology and a master’s in computer engineering from University of Central Florida.

Presentations

Serverless big data architectures: Design patterns and best practices (sponsored by AWS) Session

How do you incorporate serverless concepts and technologies into your big data architectures? Ben Snively shares use cases, best practices, and a reference architecture to help you streamline data processing and improve analytics through a combination of cloud and open source serverless technologies.

Siew Choo joined DBS Bank in November, 2014 as Managing Director and Head of Core Systems Technology. She is a member of the DBS Singapore Management Committee.

In this capacity, she is responsible for driving strategy for enterprise-wide technology solutions for Core Banking, Data Analytics, finance, risk, compliance, audit and technology solutions for the 16 countries that DBS is operating in. She is responsible for driving the technology transformation agenda for the Bank including the leverage of cloud, big data machine learning and Agile methods. Siew Choo has led the insourcing of all technology application teams and the transformation of the legacy application stack to modern technology. In 2017, Siew Choo is mandated to build the enterprise data platform as part of the Bank’s agenda to become a data-driven organization, in addition to the merger integration of DBS’s acquisition of ANZ retail and wealth businesses in 5 countries in Asia.

Prior to joining DBS Bank, Siew Choo was Head of Technology for Asia Banking at JPMorgan for 19 years. During this period, she headed the Asia Equities and Asia Transaction Banking Technology teams between 2002 – 2014 based in Japan and Hong Kong. She has led multi-year technology build-out to support JPM’s aggressive business expansion for Equities and Transaction Banking across Asia

Siew Choo was a ASEAN Pre-University Scholar (awarded by the Public Service Commission of Singapore) and she attended Victoria Junior College, Singapore in 1988-1989. She graduated as the top Honors student with First Class Honors in Bachelor of Computer Science from the National University of Singapore in 1994. Subsequently, she obtained her Master of Business Administration from the National University of Singapore in 1999. Siew Choo speaks English, Malay and conversational Chinese.

Presentations

Executive panel: Big data and the cloud Down Under Session

Major companies in Australia and New Zealand, including Air New Zealand, Westpac, and ANZ, have been pioneering the adoption of big data technologies like Hadoop. In a panel moderated by Steve Totman, senior execs from these companies share use cases, challenges, and how to be successful Down Under, on the opposite side of the world from where technologies like Hadoop got started.

Audrey Spencer-Alvarado is a business analyst for the Portland Trail Blazers. Audrey and the other members of the business analytics team provide all data insights to the various decision makers at the Trail Blazers and affiliates. She also leads the Tableau Reporting and statistical modeling projects.

Presentations

How the Portland Trail Blazers increase conversion rates with Azure Machine Learning DCS

Professional sports teams generally have very large fan bases, but only a small percentage of fans attend multiple games or purchase season tickets each year. Audrey Spencer-Alvarado explains how better identification of customers enables the Portland Trail Blazers to conduct more targeted campaigns leading to a higher conversion rate, increased revenue, and an improved customer experience.

Jessica Stauth is managing director of research at Quantopian, a crowdsourced quantitative investment firm, where she and her team are responsible for selecting algorithms from the Quantopian community for the company’s portfolio. Previously, Jessica was an equity quant analyst at the StarMine Corporation and director of quant product strategy for Thomson Reuters.

Presentations

Findata session with Jessica Stauth Findata

Findata session with Jessica Stauth

Jennifer Marie Stevens is a principal program manager with Microsoft Azure, where she oversees Microsoft’s approach to metadata management. A constant learner, Jennifer has spent her career taking on new disciplines, including product management, product marketing, engineering, and even a stint speechwriting for Microsoft’s top executives. 

Presentations

Building a Rosetta Stone for business data Session

The data-driven business must bridge the language gap between data scientists and business users. Matthew Roche and Jennifer Stevens walk you through building a business glossary that codifies your semantic layer and enables greater conversational fluency between business users and data scientists.

Bargava Subramanian is a senior data scientist at Red Hat, based in Bangalore, India. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimization in Python and R around the world. Bargava holds a master’s degree in statistics from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations

AI-driven next-generation developer tools Session

Bargava Subramanian and Harjinder Mistry explain how machine learning and deep learning techniques are helping Red Hat build smart developer tools to make software developers become more efficient.

Jagane Sundar is the CTO at WANdisco. Jagane has extensive big data, cloud, virtualization, and networking experience. He joined WANdisco through its acquisition of AltoStor, a Hadoop-as-a-service platform company. Previously, Jagane was founder and CEO of AltoScale, a Hadoop- and HBase-as-a-platform company acquired by VertiCloud. His experience with Hadoop began as director of Hadoop performance and operability at Yahoo. Jagane’s accomplishments include creating Livebackup, an open source project for KVM VM backup, developing a user mode TCP stack for Precision I/O, developing the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun Microsystems, and creating and selling a 32-bit VxD-based TCP stack for Windows 3.1 to NCD Corporation for inclusion in PC-Xware. Jagane is currently a member of the technical advisory board of VertiCloud. He holds a BE in electronics and communications engineering from Anna University.

Presentations

Active replication of Hive and other Hadoop data (sponsored by WANdisco) Session

Jagane Sundar explains how to meet your enterprise SLAs while making full use of resources with patented Active Data Replication technology—something computer science still says is impossible.

Sahaana Suri is a second year PhD student in the Stanford InfoLab, working with Peter Bailis. Sahaana’s research focuses on building easily accessible data analytics and machine learning systems that scale. She holds a bachelor’s degree in electrical engineering and computer science from the University of California, Berkeley.

Presentations

MacroBase: A search engine for fast data streams Session

Sahaana Suri offers an overview of MacroBase, a new analytics engine from Stanford designed to prioritize the scarcest resource in large-scale, fast-moving data streams: human attention. MacroBase allows reconfigurable, real-time root-cause analyses that have already diagnosed issues in production streams in mobile, data center, and industrial applications.

Inbal Tadeski is a data scientist at Anodot, a provider of real-time machine learning anomaly detection and analytics solutions for detection of business incidents. Previously, Inbal was a research engineer at HP Labs, where she specialized in machine learning and data mining. She holds an MSc in computer science with a focus on machine learning from Hebrew University in Jerusalem and a BSc in computer science from Ben Gurion University.

Presentations

A spike in sales is not always good news: On the importance of learning the relationships between time series metrics at scale HDS

Inbal Tadeski demonstrates the importance of identifying relationships between time metrics so that they can be used for predictions, root cause diagnosis, and more. Inbal shares accurate methods that work on a large scale, such as behavioral pattern similarity clustering algorithms, and strategies for reducing FPs, FNs, and computational resources and distinguishing correlation and causation.

David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cybersecurity. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Natural language understanding at scale with spaCy, Spark ML, and TensorFlow Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, TensorFlow for training custom machine-learned annotators, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

When models go rogue: Hard-earned lessons about using machine learning in production Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Sean Taylor is the manager for the Bioinformatics and High Throughput Analytics team at Seattle Children’s Research Institute (SCRI), where he manages the support delivery effort for bioinformatics and computational biology solutions for the eight research centers and almost 1,000 researchers at SCRI. Sean led design and development efforts for SCRI’s integrated precision medicine repository and is now expanding the open source approaches and big data technologies to additional centers and cores. Previously, Sean led the initiative to develop and implement a state-of-the-art bioinformatics core resource at SCRI; was a computational biologist at Amgen, customizing and driving usability in a range of end user interfaces and visualization tools while applying analytic code from multiple projects for areas such as immunotherapy and inflammation; and held a postdoc at the Fred Hutchinson Cancer Research Center, where he developed a new ultrasensitive assay to detect rare mitochondrial DNA mutations in cancer and aging. Sean holds a PhD from Yale University and a BS from Brigham Young University.

Presentations

Project Rainier: Saving lives one insight at a time Session

Marc Carlson and Sean Taylor offer an overview of Project Rainier, which leverages the power of HDFS and the Hadoop and Spark ecosystem to help scientists at Seattle Children’s Research Institute quickly find new patterns and generate predictions that they can test later, accelerating important pediatric research and increasing scientific collaboration by highlighting where it is needed.

Ankit Tharwani is Proposition Manager, Information Business, Personal and Corporate Banking, at Barclays Bank PLC.

Presentations

Enabling Data Science Self-Service through an Elastic Data Platform (Sponsored by Dell EMC) Session

Barclays and Dell EMC have partnered on the deployment of a solution called the Elastic Data Platform that gives Data Scientists the ability to self serve sandbox environments, cutting down the time to provision environments from months to hours.

Abraham Thomas is the cofounder and chief data officer of Quandl, a company he and cofounder Tammer Kamel created with the goal of making it easy for anyone to find and use high-quality data effectively in their professional decision making. Previously, Abraham was a portfolio manager and head of US bond trading at Simplex Asset Management, a multi-billion-dollar hedge fund group with offices in Tokyo, Hong Kong, and Princeton. He holds a degree from IIT Bombay.

Presentations

Oh buoy! How data science improves shipping intelligence for hedge funds Findata

Abraham Thomas demonstrates how maritime data can be used to predict physical commodity flows, in a case study that covers every stage of the data lifecycle, from raw data acquisition, data cleansing and structuring, and machine learning and probabilistic modeling to conversion to tractable format, packaging for final audience, and commercialization and distribution.

Alex Thomas is a data scientist at Indeed. Over his career, Alex has used natural language processing (NLP) and machine learning with clinical data, identity data, and (now) employer and jobseeker data. He has worked with Apache Spark since version 0.9 as well as NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with spaCy, Spark ML, and TensorFlow Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, TensorFlow for training custom machine-learned annotators, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Jer Thorp is an artist and educator from Vancouver, Canada, currently living in New York. Coming from a background in genetics, his digital art practice explores the many-folded boundaries between science, data, art, and culture. Recently, his work has been featured by The Guardian, Scientific American, The New Yorker, and Popular Science.

Thorp’s award-winning software-based work has been exhibited in Europe, Asia, North America, South America, including in the Museum of Modern Art in Manhattan.

Jer is an adjunct Professor in New York University’s ITP program, and a member of the World Economic Forum’s Network on AI, IoT and the Future of Trust. From 2012 to 2017, Jer ran The Office For Creative Research, a multi-disciplinary research group exploring new modes of engagement with data. From 2010 – 2012, Jer was the Data Artist in Residence at the New York Times. Jer is a National Geographic Fellow, and 2015 was named by Canadian Geographic as one of Canada’s greatest explorers.

Robin Thottungal is the chief data scientist and director of analytics at the Environmental Protection Agency, where he focuses on creating and implementing an agency-wide vision on analytics for effective decision making. Previously, Robin worked at Deloitte Consulting, where he focused on selling and delivering large-scale analytics projects for public-sector and commercial clients and led the global big data community of practice, developing analytical frameworks and go-to-market strategy for big data and analytics solutions, and at Johns Hopkins, where he led the development of a computational bioscience software product used for drug discovery by the pharmaceutical industry. He is the vice chair for the Washington, DC, section of the Institute of Electrical and Electronics Engineers (IEEE), the chapter chair for the IEEE Computational & Intelligence Society, and a selection panel member for the American Academy of Sciences Hellman Fellowship in Science and Technology Policy. Robin has authored seven scientific publications and presented at 20+ conferences on various aspects of data analytics. He holds a graduate degree in computational sciences from Johns Hopkins University and an undergraduate degree in computer engineering from the State University of New York. He received a National Institute of Health (NIH) predoctoral fellowship and the prestigious NIH postbaccalaureate research fellowship to do advanced computational medicine research. Outside of work, Robin enjoys spending time with his family, rock climbing, acro-yoga, running, and cooking.

Presentations

Keynote with Robin Thottungal Keynote

Keynote with Robin Thottungal

Richard Tibbetts is CEO of Empirical Systems, an MIT spinout building an AI-based data platform that provides decision support to organizations that use structured data. Previously, he was founder and CTO at StreamBase, a CEP company that merged with TIBCO in 2013, as well as visiting scientist at the Probabilistic Computing Project at MIT.

Presentations

AI for business analytics Session

Businesses have spent decades trying to make better decisions by collecting and analyzing structured data. New AI technologies are beginning to transform this process. Richard Tibbetts explores AI that guides business analysts to ask statistically sensible questions and lets junior data scientists answer questions in minutes that previously took trained statisticians hours.

Steven Totman is Cloudera’s big data subject-matter expert, helping companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Steve works with over 180 customers worldwide and helps across verticals in architectures around data management tools, data models, and ethical data usage. Previously, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents in data integration and governance- and metadata-related designs. Although he is based in NYC, Steve is happiest onsite with customers wherever they may be in the world.

Presentations

Executive panel: Big data and the cloud Down Under Session

Major companies in Australia and New Zealand, including Air New Zealand, Westpac, and ANZ, have been pioneering the adoption of big data technologies like Hadoop. In a panel moderated by Steve Totman, senior execs from these companies share use cases, challenges, and how to be successful Down Under, on the opposite side of the world from where technologies like Hadoop got started.

Griffin: Fast-tracking model development in Hadoop Session

Steven Totman and Faraz Rasheed offer an overview of Griffin, a high-level, easy-to-use framework built on top of Spark, which encapsulates the complexities of common model development tasks within four phases: data understanding, feature extraction, model development, and serving modeling results.

Associate Director Business Intelligence & Analytics

Presentations

How Visual Analytics Drove Data Asset Success at Procter & Gamble (Sponsored by ArcadiaData) Session

The early stages of delivering on your data strategies is daunting. With many claims of failed data lakes or “data swamps” the journey seems risky. That’s why you need help from industry experts to get going. In this talk, P&G will share their journey using big data, Apache Hadoop, and visual analytics to quickly discover new insights and optimize data models for analytics and data visualization.

DB Tsai is a senior research engineer working on personalized recommendation algorithms at Netflix. He’s also a member of and committer for the Apache Spark Project Management Committee (PMC). DB has implemented several algorithms, including Linear Regression and Binary/Multinomial Logistic Regression with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Previously, he was a lead machine learning engineer at Alpine Data Labs, where he led a team to develop innovative large-scale distributed learning algorithms and contributed back to the open source Apache Spark project. DB was a PhD candidate in applied physics at Stanford University. He holds a master’s degree in electrical engineering from Stanford University.

Presentations

Boosting Spark MLlib performance with rich optimization algorithms Session

Recent developments in Spark MLlib have given users the power to express a wider class of ML models and decrease model training times via the use of custom parameter optimization algorithms. Seth Hendrickson and DB Tsai explain when and how to use this new API and walk you through creating your own Spark ML optimizer. Along the way, they also share performance benefits and real-world use cases.

Madeleine Udell is assistant professor of operations research and information engineering and Richard and Sybil Smith Sesquicentennial Fellow at Cornell University, where she studies optimization and machine learning for large-scale data analysis and control, with applications in marketing, demographic modeling, medical informatics, and engineering system design. Her recent work on generalized low rank models (GLRMs) extends principal components analysis (PCA) to embed tabular datasets with heterogeneous (numerical, Boolean, categorical, and ordinal) types into a low dimensional space, providing a coherent framework for compressing, denoising, and imputing missing entries. Madeleine has developed of a number of open source libraries for modeling and solving optimization problems, including Convex.jl, one of the top 10 tools in the new Julia language for technical computing, and is a member of the JuliaOpt organization, which curates high-quality optimization software. Previously, she was a postdoctoral fellow at Caltech’s Center for the Mathematics of Information, hosted by Joel Tropp. Madeleine holds a PhD in computational and mathematical engineering (under the supervision of Stephen Boyd) from Stanford University—where she was awarded a NSF graduate fellowship, a Gabilan graduate fellowship, and a Gerald J. Lieberman fellowship and was selected as the doctoral student member of Stanford’s School of Engineering Future Committee to develop a road map for the future of engineering at Stanford over the next 10–20 years—and a BS in mathematics and physics, summa cum laude with honors, from Yale University.

Presentations

Filling in missing data with generalized low-rank models HDS

Madeleine Udell explores filling in missing data with generalized low-rank models.

Michelle Ufford is a principal architect at Netflix, where she leads centralized solutions for data engineering and analytics. Michelle is currently focused on data intelligence tooling to make it easier to develop, deploy, and manage complex datasets. Previously, she led the Data Management team at GoDaddy, where she built data engineering solutions to support the company’s innovative advertising strategies and helped pioneer Hadoop data warehousing techniques. Michelle is a published author, patented developer, award-winning open source contributor, and Most Valuable Professional (MVP) for Microsoft Data Platform. She’s an influencer in the data community and provides advice to major vendors on big data product offerings, including Microsoft, Hortonworks, and Teradata. She blogs at Hadoopsie.com.

Presentations

Working smarter, not harder: Driving data engineering efficiency at Netflix Session

What if we used the wealth of data and experience at our disposal to drive improvements in data engineering? Michelle Ufford explains how Netflix is using data to find common patterns among the chaos that enable the company to automate repetitive and time-consuming tasks and discover ways to improve data quality, reduce costs, or quickly identify and respond to issues.

Amy Unruh is a developer programs engineer for the Google Cloud Platform, with a focus on machine learning and data analytics as well as other Cloud Platform technologies. Amy has an academic background in CS/AI and has also worked at several startups, done industrial R&D, and published a book on App Engine.

Presentations

Getting started with TensorFlow Tutorial

Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from conceptual overviews all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

A deep dive into running data engineering workloads in AWS Tutorial

Jennifer Wu, Andrei Savu, Vinithra Varadharajan, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.

Manuela M. Veloso is the Herbert A. Simon University Professor in the School of Computer Science at Carnegie Mellon University, where she is the head of the Machine Learning Department. Manuela’s research, undertaken with her students, focuses on artificial intelligence, particularly for a variety of autonomous robots, including mobile service robots and soccer robots. She is a fellow of the ACM, IEEE, AAAS, and AAAI and the author of numerous publications.

Presentations

Human-AI interaction: Autonomous service robots Keynote

Manuela Veloso explores humans-AI collaboration, particularly in terms of robots learning from human sources and robot explanation generation to respond to language-based requests about their autonomous experience. Manuela concludes with a further discussion of general human-AI interaction and the opportunities for transparency and trust building of AI systems.

Ashish Verma is a managing director at Deloitte, where he leads the Big Data and IoT Analytics practice, building offerings and accelerators to enhance business processes and effectiveness. Ashish has more than 18 years of management consulting experience helping Fortune 100 companies build solutions that focus on addressing complex business problems related to realizing the value of information assets within an enterprise.

Presentations

Executive Briefing: From data insights to action—Developing a data-driven company culture Session

Ashish Verma explores the challenges organizations face after investing in hardware and software to power their analytics projects and the missteps that lead to inadequate data practices. Ashish explains how to course-correct and implement an insight-driven organization (IDO) framework that enables you to derive tangible value from your data faster.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the creation of the Lightbend Fast Data Platform, a streaming data platform built on the Lightbend Reactive Platform, Kafka, Spark, Flink, and Mesosphere DC/OS. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects and he is the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Exactly once, more than once: Apache Kafka, Heron, and Apache Apex Session

In a series of three 11-minute presentations, key members of Apache Kafka, Heron, and Apache Apex discuss their respective implementations of exactly once delivery and semantics.

Stream all the things! Session

While stream processing is now popular, streaming architectures must be highly reliable and scalable as never before, more like microservice architectures. Dean Wampler defines "stream" based on characteristics for such systems, using specific tools like Kafka, Spark, Flink, and Akka as examples, and argues that big data and microservices architectures are converging.

Peter Wang is the cofounder and CTO of Continuum Analytics, where he leads the product engineering team for the Anaconda platform and open source projects including Bokeh and Blaze. Peter has been developing commercial scientific computing and visualization software for over 15 years and has software design and development experience across a broad variety of areas, including 3D graphics, geophysics, financial risk modeling, large data simulation and visualization, and medical imaging. As a creator of the PyData conference, he also devotes time and energy to growing the Python data community by advocating, teaching, and speaking about Python at conferences worldwide. Peter holds a BA in physics from Cornell University.

Presentations

Data science beyond the sandbox (sponsored by Continuum Analytics) Session

Peter Wang explores the typical problems data science teams experience when working with other teams and explains how these issues can be overcome through cohesive collaborative efforts among data scientists, business analysts, IT teams, and more.

Melanie Warrick is a senior developer advocate at Google with a passion for machine learning problems at scale. Melanie’s previous experience includes work as a founding engineer on Deeplearning4j and as a data scientist and engineer at Change.org.

Presentations

What is AI? Data 101

Melanie Warrick explores the definition of artificial intelligence and seeks to clarify what AI will mean for our world. Melanie summarizes AI’s most important effects to date and demystifies the changes we’ll see in the immediate future, separating myth from realistic expectation.

Emmie Watt leads the data team at Air New Zealand, where she’s established a data management framework and formed a data management solutions team delivering data governance, information architecture, shared data solutions, and a data quality framework to enable data-driven decision making within the organization and deliver data platforms for quality shared data as an enabler for all digital platforms.

Presentations

Executive panel: Big data and the cloud Down Under Session

Major companies in Australia and New Zealand, including Air New Zealand, Westpac, and ANZ, have been pioneering the adoption of big data technologies like Hadoop. In a panel moderated by Steve Totman, senior execs from these companies share use cases, challenges, and how to be successful Down Under, on the opposite side of the world from where technologies like Hadoop got started.

Ryan Weil is chief scientist in the Health Products and Solutions group of Leidos. Ryan has nearly 20 years of experience in analytics and bioinformatics. Previously, he served as the program manager in support of the CDC Office of Infectious Disease’s bioinformatics and data analytics effort. Ryan holds a BS in microbiology from Texas A&M College Station and a PhD in molecular biophysics from UT Southwestern Medical Center in Dallas.

Presentations

Tracking the opioid-fueled HIV outbreak with big data (sponsored by Trifacta) Session

Ells Campbell, Connor Carreras, and Ryan Weil explain how the Microbial Transmission Network Team (MTNT) at the Centers for Disease Control (CDC) is leveraging new techniques in data collection, preparation, and visualization to advance the understanding of the spread of HIV/AIDS.

Brooke Wenig received her M.S. in Computer Science with highest honors from UCLA in June 2017, focusing on Distributed Machine Learning. She works as a consultant for Databricks, and has worked at Splunk and Under Armour as a KPCB Fellow. She is a Teaching Associate at UCLA, and has taught graduate machine learning, senior software engineering, and introductory programming courses. She speaks Mandarin Chinese fluently and enjoys cycling.

Presentations

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML Tutorial

This one-day hands-on class introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case.

Kyle Wild co-founded Keen IO in 2012 as a part of the first class of Techstars Cloud. In his earlier career, he held positions focused on product management, software engineering, game design, and distributed systems scalability. At Keen IO, Kyle sits on the board of directors and has worked in the areas of API design, brand marketing, developer community evangelism, finance, organizational design, and recruiting. Kyle is a small-time angel investor and startup advisor, and has himself spearheaded several rounds of financing from angel investors, seed funds, and large venture capital funds.

Kyle holds a BS in General Engineering from the University of Illinois at Urbana-Champaign, and knows quite a bit about analytics.

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

Edd Wilder-James is a technology analyst, writer, and entrepreneur based in California. He’s helping transform businesses with data as VP of strategy for Silicon Valley Data Science. Formerly Edd Dumbill, Edd was the founding program chair for the O’Reilly Strata conferences and chaired the Open Source Convention for six years. He was also the founding editor of the peer-reviewed journal Big Data. A startup veteran, Edd was the founder and creator of the Expectnation conference-management system and a cofounder of the Pharmalicensing.com online intellectual-property exchange. An advocate and contributor to open source software, Edd has contributed to various projects such as Debian and GNOME and created the DOAP vocabulary for describing software projects. Edd has written four books, including O’Reilly’s Learning Rails.

Presentations

Executive Briefing: Preparing your infrastructure for AI Session

Edd Wilder-James outlines a road map for executives who are beginning to consider their strategies for implementing artificial intelligence in their critical processes.

The business case for AI, Spark, and friends Data 101

AI is white-hot at the moment, but where can it really be used? Developers are usually the first to understand why some technologies cause more excitement than others. Edd Wilder-James relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring

Rose Winterton is a product director at Pitney Bowes, where she leads the product direction for the company’s location intelligence products and solutions with a recent focus on use of spatial processing in big data environments. Rose has 15 years’ experience in location intelligence and a wide range of personal customer experience in EMEA and the US, covering telecommunications, insurance, public sector, geosciences, and retail vertical markets. Previously, Rose worked on developing customer solutions as a senior consultant before moving into management. Rose studied GIS and remote sensing at University College London and geology at Oxford University.

Presentations

Benefits of big data geoenrichment for better business outcomes DCS

Geoenrichment uses a location-based key to manage data and provide a single view of a location. Rose Winterton explains how Pitney Bowes's Spectrum Technology Platform for big data allows fast processing of location-based data for address validation, geoenrichment, analysis, and integration with operational processes for more accurate decision making and better business outcomes.

Ian Wrigley is the director of education services at Confluent, where he heads the team building and delivering courses focused on Apache Kafka and its ecosystem. Over his 25-year career, Ian has taught tens of thousands of students in subjects ranging from C programming to Hadoop development and administration.

Presentations

Building real-time data pipelines with Apache Kafka Tutorial

Ian Wrigley demonstrates how Kafka Connect and Kafka Streams can be used together to build real-world, real-time streaming data pipelines. Using Kafka Connect, you'll ingest data from a relational database into Kafka topics as the data is being generated and then process and enrich the data in real time using Kafka Streams before writing it out for further analysis.

Bichen Wu is a PhD student at UC Berkeley. His research focuses on deep learning, computer vision, and autonomous driving.

Presentations

Efficient neural networks for perception for autonomous vehicles HDS

Bichen Wu explores perception tasks for autonomous driving and explains how to design efficient neural networks to address critical issues such as latency, energy efficiency, and model size.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud strategy and solutions. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Presentations

A deep dive into running data engineering workloads in AWS Tutorial

Jennifer Wu, Andrei Savu, Vinithra Varadharajan, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.

How to successfully run data pipelines in the cloud Session

Data engineering is the foundational workload run prior to implementing most data analytic and operational database use cases. Jennifer Wu explores the latest technologies that deliver data engineering as a service and shares a customer case study in which this technology is integrated into a real-world data analytics pipeline.

Stephen Wu is a senior program manager for big data at Microsoft.

Presentations

Performance tuning your Hadoop/Spark clusters to use cloud storage Session

Remote storage in the cloud provides an infinitely scalable, cost-effective, and performant solution for big data customers. Adoption is rapid due to the flexibility and cost savings associated with unlimited storage capacity when separating compute and storage. Stephen Wu demonstrates how to correctly performance tune your workloads when your data is stored in remote storage in the cloud.

Wei Yan is a senior engineer at Uber, where he builds data processing and querying systems that scale along with Uber’s hypergrowth.

Presentations

Geospatial big data analysis at Uber Session

Uber's geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao Luo and Wei Yan explain how Uber runs geospatial analysis efficiently in its big data systems, including Hadoop, Hive, and Presto.

Yan Yan is an engineer at LinkedIn, where he works on the Voldemort and Venice team within the company’s data infrastructure organization. He has extensive experience working on cluster management, Zookeeper, Helix, and distributed systems in general.

Presentations

Introducing Venice: A derived datastore for batch, streaming, and lambda architectures Session

Companies with batch and stream processing pipelines need to serve the insights they glean back to their users, an often-overlooked problem that can be hard to achieve reliably and at scale. Felix GV and Yan Yan offer an overview of Venice, a new data store capable of ingesting data from Hadoop and Kafka, merging it together, replicating it globally, and serving it online at low latency.

Fangjin Yang is a coauthor of the open source Druid project and a cofounder of Imply, a data analytics startup based in San Francisco. Previously, Fangjin held senior engineering positions at Metamarkets and Cisco Systems. Fangjin has a BASc in electrical engineering and an MASc in computer engineering from the University of Waterloo, Canada.

Presentations

Analytics at Wikipedia Session

The Wikimedia Foundation (WMF) is a nonprofit charitable organization. As the parent company of Wikipedia, one of the most visited websites in the world, WMF faces many unique challenges around its ecosystem of editors, readers, and content. Andrew Otto and Fangjin Yang explain how the WMF does analytics and offer an overview of the technology it uses to do so.

Yuhao Yang is a software engineer at Intel, where he provides implementation, consulting, and tuning advice on the Hadoop ecosystem to industry partners. Yuhao’s area of focus is distributed machine learning, especially large-scale analytical applications and infrastructure on Spark. He’s also an active contributor to Spark MLlib (50+ patches), has delivered the implementation of online LDA, QR decomposition, and several transformers of Spark feature engineering, and has provided improvements on some important algorithms.

Presentations

Building advanced analytics and deep learning on Apache Spark with BigDL Session

Yuhao Yang and Zhichao Li discuss building end-to-end analytics and deep learning applications, such as speech recognition and object detection, on top of BigDL and Spark and explore recent developments in BigDL, including Python APIs, notebook and TensorBoard support, TensorFlow model R/W support, better recurrent and recursive net support, and 3D image convolutions.

FORTHCOMING

Presentations

Changing the landscape with deep learning and accelerated analytics (Sponsored by NVIDIA) Session

More and more customers are combining the benefits of GPU accelerated analytics (AA) and deep learning (DL) to extract AI-driven insights out of their data.

Chuck Yarbrough is the VP of Solutions Marketing and Management at Pentaho, a leading big data analytics company that helps organizations engineer big data connections, blend data, and report and visualize all of their data. Chuck is responsible for creating and driving Pentaho solutions that leverage the Pentaho platform, enabling customers to implement big data solutions quicker and achieve greater ROI faster. Chuck has more than 20 years of experience helping organizations use technology to their advantage to ensure they can run, manage, and transform their business through better use of data. A lifelong participant in the data game, Chuck has held leadership roles at Deloitte Consulting, SAP Business Objects, Hyperion, and National Semiconductor.

Presentations

The Converging World of Big Data and IoT (Sponsored by Pentaho) Session

This session will focus on edge to outcomes with specific blueprint examples of where IoT and Big Data have combined to solve significant business challenges and take advantages of business opportunities.

Lucy Yu is an engineer at MemSQL. Lucy holds a degree in computer science and a master of engineering from MIT, where, under Matei Zaharia, she worked on implementing an experimental framework for work sharing in Spark.

Presentations

Exploring real-time capabilities with Spark SQL Session

Lucy Yu demonstrates how to extend the Spark SQL abstraction to support more complex pushdown, such as group by, subqueries, and joins.

Matei is an assistant professor in the Computer Science department at Stanford, where he works on computer systems and big data.

Presentations

Weld: Accelerating data science by 100x Session

Modern data applications combine functions from many optimized libraries (e.g., pandas and TensorFlow) and yet do not achieve peak hardware performance due to data movement across functions. Shoumik Palkar and Matei Zaharia offer an overview of Weld, a new interface to implement functions in these libraries while enabling optimizations across them.

Ben Zaitlen is a data scientist and developer at Continuum Analytics. He has several years of experience with Python and is passionate about any and all forms of data. Currently, he spends his time thinking about usability of large data systems and infrastructure problems as they relate to data management and analysis.

Presentations

Scaling Python data analysis Tutorial

The Python data science stack, which includes NumPy, pandas, and scikit-learn, is efficient and intuitive but only for in-memory data and a single core. Matthew Rocklin and Ben Zaitlen demonstrate how to parallelize and scale your Python workloads to multicore machines and multimachine clusters.

Tristan Zajonc is a senior engineering manager at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and an MPA in international development from Harvard and a BA in economics from Pomona College.

Presentations

Data science at team scale: Considerations for sharing, collaborating, and getting to production Session

Data science alone is easy. Data science with others, whether in the enterprise or on shared distributed systems, requires a bit more work. Tristan Zajonc and Thomas Dinsmore discuss common technology considerations and patterns for collaboration in large teams and for moving machine learning into production at scale.