Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Ziya Ma is the general manager of the global Big Data Technologies organization in Intel’s Software and Services group (SSG) in the System Technologies and Optimization (STO) division. Her organization focuses on optimizing big data on Intel’s platform, leading open source efforts in the Apache community, and linking innovation in industry analytics to bring about the best and the most complete big data experiences. She works closely with Intel product teams, open source communities, partners from the industry, and academia to advise on implementing and optimizing the Intel platform for Hadoop or Spark ecosystems. Previously, Ziya held various management positions in Intel’s Technology Manufacturing group (TMG), where she was responsible for delivering embedded software for factory equipment, databases for manufacturing execution and process control, UI software, and more, and was product development software director of Intel IT, where she delivered software lifecycle management tools and infrastructure and analytics solutions to Intel software teams worldwide. She also worked at Motorola earlier in her career. Ziya holds a PhD and MS in computer science and engineering from Arizona State.

Presentations

Accelerate analytics and AI innovations with Intel (sponsored by Intel) Keynote

Ziya Ma outlines the challenges for applying machine learning and deep learning at scale and shares solutions that Intel has enabled for customers and partners.

Vijay Srinivas Agneeswaran is director of technology at SapientNitro. Vijay has spent the last 10 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine-learning algorithms, such as LDA and random forests, over Spark. He also led a team that build a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He is a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies. Vijay holds a a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.

Presentations

Big data computations: Comparing Apache HAWQ, Druid, and GPU databases Session

The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. Vijay Srinivas Agneeswaran explores prototypes built on top of Apache HAWQ, Druid, and Kinetica, one of the open source GPU databases. Results show that Kinetica on a single G2.8x node outperformed clusters of HAWQ and Druid nodes.

Graham Ahearne is director of product management for security analytics at Corvil, where he is actively building the next generation of accelerated threat detection and investigation, powered by true real-time analysis of network data. A recognized industry expert, Graham has been advising and building information security solutions for Fortune 500 companies for over 15 years. His expertise and experience spans a broad range of information security technology types, with specialist focus on network forensics, security analytics, threat intelligence, managed services, and host-based security controls. Graham is a Certified Information Systems Security Professional (CISSP).

Presentations

Safeguarding electronic stock trading: Challenges and key lessons in network security Session

Fergal Toomey and Graham Ahearne outline the challenges facing network security in complex industries, sharing key lessons learned from their experiences safeguarding electronic trading environments to demonstrate the utility of machine learning and machine time network data analytics.

Tyler Akidau is a staff software engineer at Google Seattle. He leads technical infrastructure’s internal data processing teams (MillWheel & Flume), is a founding member of the Apache Beam PMC, and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 Dataflow Model paper and the Streaming 101 and Streaming 102 articles on the O’Reilly website. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Meet the Expert with Tyler Akidau (Google) Meet the Experts

Chat with Tyler about stream processing in general or Apache Beam specifically.

Realizing the promise of portability with Apache Beam Session

The world of big data involves an ever-changing field of players. Much as SQL is a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. Tyler Akidau explains how this vision has been realized and discusses the challenges that lie ahead.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Scott Kurth, and Stephen O'Sullivan field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for (and the evolving role of) the CDO. Even if you don’t have a specific question, join in to hear what others are asking.

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and John Akred explain how to create a modern data strategy that powers data-driven business.

What's your data worth? Session

Valuing data can be a headache. The unique properties of data make it difficult to assess its overall value using traditional valuation approaches. John Akred discusses a number of alternative approaches to valuing data within an organization for specific purposes so that you can optimize decisions around its acquisition and management.

Antonio Alvarez is the head of data innovation at Isban UK, which aims to spearhead the transformation to a data-driven organization through digital technology. In partnership with the CDO, Antonio is creating a collaborative environment where innovative strategies and propositions around data from all sides of the business can create value for customers more quickly. With a quick pace of adoption, Santander UK is now implementing different frameworks for scaling and broadening the impact of data to disrupt the bank from inside through the adoption of a guided self-service approach. Antonio has a background in economics and 18 years of experience in financial services across four countries in business, technology, change, and data.

Presentations

Data citizenship: The next stage of data governance Session

Successful organizations are becoming increasingly Agile, and the autonomy and empowerment that Agile brings create new, active modes of engagement. Data governance however is still very much a centralized task that only CDOs and data owners actively care about. Antonio Alvarez and Lidia Crespo outline a more engaging and active method of data governance: data citizenship.

Anima Anandkumar is a principal scientist at Amazon Web Services. Anima is currently on leave from UC Irvine, where she is an associate professor. Her research interests are in the areas of large-scale machine learning, nonconvex optimization, and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. Previously, she was a postdoctoral researcher at MIT and a visiting researcher at Microsoft Research New England. Anima is the recipient of several awards, including the Alfred. P. Sloan fellowship, the Microsoft faculty fellowship, the Google research award, the ARO and AFOSR Young Investigator awards, the NSF CAREER Award, the Early Career Excellence in Research Award at UCI, the Best Thesis Award from the ACM SIGMETRICS society, the IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums, such as the Quora ML session, Huffington Post, Forbes, and O’Reilly Media. Anima holds a BTech in electrical engineering from IIT Madras and a PhD from Cornell University.

Presentations

Distributed deep learning on AWS using Apache MXNet Tutorial

Deep learning is the state of the art in domains such as computer vision and natural language understanding. Apache MXNet is a highly flexible and developer-friendly deep learning framework. Anima Anandkumar provides hands-on experience on how to use Apache MXNet with preconfigured Deep Learning AMIs and CloudFormation Templates to help speed your development.

Distributed deep learning on AWS using Apache MXNet Session

Anima Anandkumar demonstrates how to use preconfigured Deep Learning AMIs and CloudFormation templates on AWS to help speed up deep learning development and shares use cases in computer vision and natural language processing.

Meet the Expert with Anima Anandkumar (UC Irvine) Meet the Experts

Anima is available to discuss deep learning at scale on AWS and various AI services on AWS.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Meet the Expert with Jesse Anderson (Big Data Institute) Meet the Experts

Jesse will be on hand to talk about his book, Data Engineering Teams and how data engineering teams should be set up. He also loves to talk about cutting edge technologies such as Kafka, Spark Streaming, and Apache Beam.

The five dysfunctions of a data engineering team Session

Early project success is predicated on management making sure a data engineering team is ready and has all of the skills needed. Jesse Anderson outlines five of the most common non-technology reasons why data engineering teams fail.

André Araujo is a solutions architect with Cloudera. Previously, he was an Oracle database administrator. An experienced consultant with a deep understanding of the Hadoop stack and its components, André is skilled across the entire Hadoop ecosystem and specializes in building high-performance, secure, robust, and scalable architectures to fit customers’ needs. André is a methodical and keen troubleshooter who loves making things run faster.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Amitai Armon is the chief data scientist for Intel’s Advanced Analytics group, which provides solutions for the company’s challenges in diverse domains ranging from design and manufacturing to sales and marketing, using machine learning and big data techniques. Previously, Amitai was the cofounder and director of research at TaKaDu, a provider of water-network analytics software to detect hidden underground leaks and network inefficiencies. The company received several international awards, including the World Economic Forum Technology Pioneers award. Amitai has about 15 years of experience in performing and leading data science work. He holds a PhD in computer science from the Tel Aviv University in Israel, where he previously completed his BSc (cum laude, at the age of 18).

Presentations

Reducing neural-network training time through hyperparameter optimization Session

Neural-network models have a set of configuration hyperparameters tuned to optimize a given model's accuracy. Yahav Shadmi demonstrates how to select hyperparameters to significantly reduce training time while maintaining accuracy, present examples for popular neural network models used for text and images, and describe a real-world optimization method for tuning.

Carme Artigas is the founder and CEO of Synergic Partners, a strategic and technological consulting firm specializing in big data and data science (acquired by Telefónica in November 2015). She has more than 20 years of extensive expertise in the telecommunications and IT fields and has held several executive roles in both private companies and governmental institutions. Carme is a member of the Innovation Board of CEOE and the Industry Affiliate Partners at Columbia University’s Data Science Institute. An in-demand speaker on big data, she has given talks at several international forums, including Strata + Hadoop World, and collaborates as a professor in various master’s programs on new technologies, big data, and innovation. Carme was recently recognized as the only Spanish woman among the 30 most influential women in business by Insight Success. She holds an MS in chemical engineering and an MBA from Ramon Llull University in Barcelona and an executive degree in venture capital from UC Berkeley’s Haas School of Business.

Presentations

Executive Briefing: Analytics centers of excellence as a way to accelerate big data adoption by business Session

Big data technology is mature, but its adoption by business is slow, due in part to challenges like a lack of resources or the need for a cultural change. Carme Artigas explains why an analytics center of excellence (ACoE), whether internal or outsourced, is an effective way to accelerate the adoption and shares an approach to implementing an ACoE.

Doug Ashton is a senior data scientist at Mango Solutions, where he provides training and consultancy to a range of industries, from government to telecommunications and web retailers. Doug is a proponent of reproducible research and has spoken on such topics as reproducible environments and data analysis in teams.

Presentations

Spark and R with sparklyr Tutorial

R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Aimee Gott, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.

Sascha Askani is a senior systems engineer at inovex GmbH. Sascha has a strong storage and disaster recovery background and has helped various customers master their digital transformation challenges. He now focuses on solutions for his customers’ big data needs, with an emphasis on distributed storage solutions.

Presentations

Building containerized Spark on a solid foundation with Quobyte and Kubernetes Session

Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Bäurer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes.

David Barber is reader in information processing in the Department of Computer Science at UCL, where he develops novel machine-learning algorithms. David is also a cofounder of the NLP company reinfer.io.

Presentations

Fast and effective training for deep learning Tutorial

David Barber considers two issues related to training of deep learning systems—natural language modeling and the use of higher-order optimization methods for deep learning—offering an overview of the topics, exploring recent work, and demonstrating how to use them effectively.

Denis Bauer leads the Transformational Bioinformatics team at Australia’s national science agency, the Commonwealth Scientific and Industrial Research Organisation (CSIRO)—the research institution behind fast WiFi, the Hendra virus vaccine, and polymer banknotes. She is also involved in initiatives to bring genomics into medical practice. Denis holds a PhD in bioinformatics with expertise in machine learning and genomics.

Presentations

How Apache Spark and AWS Lambda empower researchers to identify disease-causing mutations and engineer healthier genomes Tutorial

Denis C. Bauer explores how genomic research has leapfrogged to the forefront of big data and cloud solutions, outlines how to deal with “big” (many samples) and “wide” (many features per sample) data and how to keep runtime constant by using instantaneously scalable microservices with AWS Lambda, and contrasts Spark- and Lambda-based parallelization.

Daniel Bäurer is head of operations at inovex GmbH. Daniel has been designing and operating complex systems for over 15 years. He currently focuses on data center automation and Hadoop platforms.

Presentations

Building containerized Spark on a solid foundation with Quobyte and Kubernetes Session

Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Bäurer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes.

Arturo Bayo is team leader and senior data engineer at Synergic Partners, where he specializes in banking and finance projects. He has broad knowledge of database administration (SQL, Mongo DB, and Cassandra) and big data (Hadoop, R, Hive, and Spark). Arturo holds a bachelor’s of science degree in computer engineering from the Universidad Autonoma of Madrid and a bachelor’s of business administration (BBA) from UNED.

Presentations

Continuous analytics: Integrating the data hub in a DevOps pipeline Session

Arturo Bayo and Alvaro Fernandez Velando explain how a data hub strategy helps clarify data sharing and governance in an organization and share one way to implement a data hub architecture using big data technology and resources that are already established in the enterprise.

Hellmar Becker is a solutions engineer at Hortonworks, where he is helping spread the word about what you can do with data in the modern world. Hellmar has worked in a number of positions in big data analytics and digital analytics. Previously, he worked at ING Bank implementing the Datalake Foundation project (based on Hadoop) within client information management.

Presentations

Daddy, what color is that airplane overhead, and where is it going? Session

Hellmar Becker and Jorn Eilander explore the real-time collection and predictive analytics of flight radar data with IoT devices, NiFi, HBase, Spark, and Zeppelin.

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer experience. Tim can frequently be found at speaking at conferences internationally and in the United States. He is the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio Podcast. He lives in Littleton, Colorado, with the wife of his youth and and their youngest child, the other two having mostly grown up.

Presentations

Ask me anything: Real-time data pipelines with Apache Kafka AMA

Join Tim Berglund to discuss topics from his tutorial, Real-time data pipelines with Apache Kafka, or ask any other questions you have.

Real-time data pipelines with Apache Kafka Tutorial

Tim Berglund demonstrates how to use Kafka Connect and Kafka Streams to build real-world, real-time streaming data pipelines—using Kafka Connect to ingest data from a relational database into Kafka topics as the data is being generated and then using Kafka Streams to process and enrich the data in real time before writing it out for further analysis.

Wojciech Biela is the engineering manager for the Warsaw-based Teradata Center for Hadoop team (within Teradata Labs), which is devoted to open source Presto development. Previously, Wojciech helped build the Polish branch for Hadapt, an SQL-on-Hadoop startup from Boston, which was acquired by Teradata in 2014, and developed projects and led development teams across many industries, from large-scale search, ecommerce, and personal banking to POS systems. Wojciech graduated from the Wrocław University of Technology.

Presentations

Presto: Distributed SQL done faster Session

Wojciech Biela and Łukasz Osipiuk offer an introduction to Presto, an open source distributed analytical SQL engine that enables users to run interactive queries over their datasets stored in various data sources, and explore its applications in various big data problems.

Matt Brandwein is director of product management at Cloudera, driving the platform’s experience for data scientists and data engineers. Before that, Matt led Cloudera’s product marketing team, with roles spanning product, solution, and partner marketing. Previously, he built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in computer science and mathematics from the University of Massachusetts Amherst.

Presentations

Making self-service data science a reality Session

Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality.

Mikio Braun is delivery lead for recommendation and search at Zalando, one of the biggest European fashion platforms. Mikio holds a PhD in machine learning and worked in research for a number of years before becoming interested in putting research results to good use in the industry.

Presentations

Deep learning in practice Session

Deep learning has become the go-to solution for challenges such as image classification or speech processing, but does it work for all application areas? Mikio Braun offers background on deep learning and shares his practical experience working with these exciting technologies.

Kay H. Brodersen is a data scientist at Google, where he works on Bayesian statistical models for causal inference in large-scale randomized experiments and anomaly detection in time series data. Kay studied at Muenster (Germany), Cambridge (UK), and Oxford (UK) and holds a PhD degree from ETH Zurich.

Presentations

Inferring the effect of an event using CausalImpact HDS

Causal relationships empower us to understand the consequences of our actions and decide what to do next. This is why identifying causal effects has been at the heart of data science. Kay Brodersen offers an introduction to CausalImpact, a new analysis library developed at Google for identifying the causal effect of an intervention on a metric over time.

Paul Brook is the EMEA big data analytics team lead at Dell EMC, where he leads a team of specialists working with customers, business partners, and technology integrators to deliver analytics and big data solutions that make money and save money for partners and customers. Previously, Paul was responsible for various programs within Dell, including hyperscale/cloud and high-performance computing. Prior to Dell, Paul held sales and product management roles within the applications development and managed services sectors and worked for a UK consultancy that specialized in business performance improvement.

Presentations

Another one bytes the dust (sponsored by Dell EMC) Keynote

Reliance only upon traditional data could lead to a catastrophic decision. Paul Brook explores the music industry to show how using modern information points, derived from technologies within the Hadoop framework, brings new data that changes the way a business or organization makes decisions.

Natalino Busa is the head of data science at Teradata, where he leads the definition, design, and implementation of big, fast data solutions for data-driven applications, such as predictive analytics, personalized marketing, and security event monitoring. Previously, Natalino served as enterprise data architect at ING and as senior researcher at Philips Research Laboratories on the topics of system-on-a-chip architectures, distributed computing, and parallelizing compilers. Natalino is an all-around technology manager, product developer, and innovator with a 15+ year track record in research, development, and management of distributed architectures and scalable services and applications.

Presentations

Classifying restaurant pictures: An API with Spark and Slider Session

Natalino Busa shares an implementation for classifying pictures based on Spark and Slider that was developed during the 2016 Yelp Restaurant Photo Classification challenge. Spark processes data and trains the ML model, which consists of deep learning and ensemble classification methods, while picture scoring is exposed via an API that is persisted and scaled with Slider.

Yishay Carmiel is the head of Spoken Labs, the strategic artificial intelligence and machine learning research arm of Spoken Communications. Spoken Labs develops and implements industry-leading deep learning and AI technologies for speech recognition (ASR), natural language processing (NLP) and advanced voice data extraction. Yishay and his team are currently working on bleeding-edge innovations that make the real-time customer experience a reality – at scale. Yishay has nearly 20 years’ experience as an algorithm scientist and technology leader building large-scale machine learning algorithms and serving as a deep learning expert.

Presentations

Conversation AI: From theory to the great promise Session

For years, people have been talking about the great promise of conversation AI. Recently, deep learning has taken us a few steps further toward achieving tangible goals, making a big impact on technologies like speech recognition and natural language processing. Yishay Carmiel offers an overview of the impact of deep learning, recent breakthroughs, and challenges for the future.

Michelle Casbon is director of data science at Qordoba. Michelle’s development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. Previously, she was a senior data science engineer at Idibon, where she built tools for generating predictions on textual datasets. She loves working with open source projects and has contributed to Apache Spark and Apache Flume. Her writing has been featured in the AI section of O’Reilly Radar. Michelle holds a master’s degree from the University of Cambridge, focusing on NLP, speech recognition, speech synthesis, and machine translation.

Presentations

Machine learning to automate localization with Apache Spark and other open source tools Session

Supporting multiple locales involves the maintenance and generation of localized strings. Michelle Casbon explains how machine learning and natural language processing are applied to the underserved domain of localization using primarily open source tools, including Scala, Apache Spark, Apache Cassandra, and Apache Kafka.

Haifeng Chen is a senior software architect at Intel’s Asia Pacific R&D Center. He has more than 12 years’ experience in software design and development, big data, and security, with a particular interest in image processing. Haifeng is the author of image browsing, editing, and processing software ColorStorm.

Presentations

Speed up big data encryption in Apache Hadoop and Spark Session

As the processing capability of the modern platforms come to memory speed, securing big data using encryption usually hurts performance. Haifeng Chen shares proven ways to speed up data encryption in Hadoop and Spark, as well as the latest progress in open source, and demystifies using hardware acceleration technology to protect your data.

Mandy Chessell is a master inventor, fellow of the Royal Academy of Engineering, and a distinguished engineer at IBM, where she is driving IBM’s strategic move to open metadata and governance through the Apache Atlas open source project. Mandy is a trusted advisor to executives from large organizations and works with them to develop strategy and architecture relating to the governance, integration, and management of information. You can find out more information on her blog.

Presentations

Building the metadata highway (sponsored by IBM) Keynote

Mandy Chessell explores the role that open source, embeddable, and interconnected metadata capability play in building a metadata highway.

Rumman Chowdhury is a senior manager and AI lead at Accenture, where she works on cutting-edge applications of artificial intelligence and leads the company’s responsible and ethical AI initiatives. She also serves on the board of directors for three AI startups. Rumman’s passion lies at the intersection of artificial intelligence and humanity. She comes to data science from a quantitative social science background. She has been interviewed by Software Engineering Daily, the PHDivas podcast, German Public Television, and fashion line MM LaFleur. In 2017, she gave talks at the Global Artificial Intelligence Conference, IIA Symposium, ODSC Masterclass, and the Digital Humanities and Digital Journalism conference, among others. Rumman holds two undergraduate degrees from MIT and a master’s degree in quantitative methods of the social sciences from Columbia University. She is near completion of her PhD from the University of California, San Diego.

Presentations

Mister P: Imputing granularity from your data Session

Multilevel regression and poststratification (MRP) is a method of estimating granular results from higher-level analyses. While it is generally used to estimate survey responses at a more granular level, MRP has clear applications in industry-level data science. Rumman Chowdhury reviews the methodology behind MRP and provides a hands-on programming tutorial.

Ira Cohen is a cofounder of Anodot and its chief data scientist, where he is responsible for developing and inventing its real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Learning the relationships between time series metrics at scale; or, Why you can never find a taxi in the rain HDS

Identifying the relationships between time series metrics lets them be used for predictions, root cause diagnosis, and more. Ira Cohen shares accurate methods that work on a large scale (e.g., behavioral pattern similarity clustering algorithms) and strategies for reducing FPs and FNs, reducing computational resources, and distinguishing correlation and causation.

Darren Cook is a director at QQ Trend, a financial data analysis and data products company. Darren has over 20 years of experience as a software developer, data analyst, and technical director and has worked on everything from financial trading systems to NLP, data visualization tools, and PR websites for some of the world’s largest brands. He is skilled in a wide range of computer languages, including R, C++, PHP, JavaScript, and Python. Darren is the author of two books, Data Push Apps with HTML5 SSE and Practical Machine Learning with H2O, both from O’Reilly. The latter can help you take what you learn from this talk and start actually using machine-learning algorithms in your organization.

Presentations

Machine-learning algorithms: What they do and when to use them Tutorial

Darren Cook explores the main types of machine-learning algorithms, describing the kinds of task each is suited to, the explainability, repeatability, scalability, training time, sensitivity to data issues, and downsides of each, and the types of answers you can hope to get from them.

Eddie Copeland is director of government innovation at Nesta, an innovation foundation and think tank, where he leads work on government data, behavioral insights, digital public services, and digital democracy. Previously, Eddie was head of technology policy at Policy Exchange, one of the UK’s most influential think tanks. He is the author of five reports on government use of technology and data and a book on UK think tanks and is a regular writer and speaker on how government and public sector organizations can deliver more and better with less through smarter use of technology and data. He blogs at Eddiecopeland.me and tweets as @EddieACopeland.

Presentations

Lessons from piloting the London Office of Data Analytics Keynote

Eddie Copeland shares lessons learned from piloting the London Office of Data Analytics, a collaboration between the Greater London Authority, Nesta, and ASI Data Science that is exploring the potential of applying data analytics to reform public services.

Lidia Crespo is the chief data steward at Santander UK, where she leads the CDO team that supervises the governance of Santander UK big data platform. She and her team have been instrumental to the adoption of the technology platform by creating a sense of trust and with their deep knowledge of the data of the organization. With her experience in complex and challenging international projection projects and an audit, IT, and data background, Lidia brings a combination difficult to find.

Presentations

Data citizenship: The next stage of data governance Session

Successful organizations are becoming increasingly Agile, and the autonomy and empowerment that Agile brings create new, active modes of engagement. Data governance however is still very much a centralized task that only CDOs and data owners actively care about. Antonio Alvarez and Lidia Crespo outline a more engaging and active method of data governance: data citizenship.

Andy Crisp is leader for the EU and Asia Data Engineering team at Dun & Bradstreet, where his remit ensures that he keeps more than an eye on innovation. Andy started his career at Dun & Bradstreet in sales, which offered a way into the world of big data. He has since tirelessly led innovative and creative thinking, particularly in terms of how to build and improve the D&B global data asset. Andy was recognized by DataIQ as one of the 100 most influential people in data and in 2015 was on Information Age’s shortlist for the UK’s top 50 data leaders.

Presentations

Artificial intelligence in the enterprise Session

Martin Goodson gives a tell-all account of an ultimately successful installation of a deep learning system in an enterprise environment. Andy Crisp then shares insights into the challenges of integrating artificial intelligence systems into real-world business processes.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata + Hadoop World conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Making the future happen sooner

Business advantage used to come from scale. Today, it comes from adaptivity. In this session, Strata conference chair and Lean Analytics author Alistair Croll shares case studies from Tesla, Blockbuster, Waze, and others that demonstrate how companies can put data—and first-mover advantage—to work in competitive markets.

Thursday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Neil Cullum is a senior solutions engineer on the Innovation IT team at BMC, where he specializes in Control-M and workload automation in areas such as Hadoop, DevOps, and the cloud. Neil has over 22 years of experience in the IT industry, helping customers through technical and business challenges.

Presentations

Ingest, process, analyze: Automation and integration through the big data journey Session

Neil Cullum and Alon Lebenthal demonstrate how BMC can help automate every aspect of the big data journey with Control-M’s enterprise-grade automation capabilities and job-as-code approach, helping you deliver big data projects faster and better.

Shannon Cutt is the development editor in the data practice area at O’Reilly Media.

Presentations

Data 101 welcome Tutorial

Shannon Cutt welcomes you to Data 101.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Shree Dandekar is the vice president and general manager of the Honeywell Connected Plant, part of Honeywell Process Solutions, a leading advanced software and technology provider. Shree has deep experience in developing technology-based products and solutions to drive growth. He joined Honeywell after 17 years at Dell, where he served as general manager and executive director for its IoT, analytics and data science, and SaaS business. Shree also has experience in the hospitality, banking, retail, and healthcare industries and has been granted 41 software patents in the United States. He holds an MBA from the University of Texas, an MS in computer science from Texas State University, an MS in management information systems from Texas Tech University, and a BS in mechanical engineering from the University of Pune.

Presentations

The digital twin: Real and gaining ground Session

A digital twin is a virtual model of a product or service that allows analysis of data and monitoring of systems to avert problems before they even occur—and even plan for the future by using simulations. Shree Dandekar explores a new cloud-based service from the Honeywell Connected Plant that provides industrial users with around-the-clock monitoring of plant data and rigorous simulations.

Pratim Das is a specialist solutions architect for big data and analytics at AWS for EME, where he advises customers on big data architecture, migration of big data workloads to the cloud, and implementing best practices and guidelines for analytics and works with customers on advanced analytics, such as building predictive models, image recognition, NLP, and smart cities. Pratim also works closely with AWS product teams and actively participates in shaping product roadmap for his customers. He is an expert in AWS big data services such as EMR, Redshift, Kinesis, Athena, and QuickSight and works closely with the open source ecosystem, including Cassandra, Elasticsearch, Hive, HBase, Spark, R, and other commercial solutions such as SAS, Teradata, and Tableau on AWS. He has almost 18 years of experience across various industry verticals, including local and central government, media, not-for-profits, security, management consultancies, and the cloud and has successfully delivered software systems, starting from databases, data warehouses, business intelligence, distributed applications, and big data solutions for various customers.

Presentations

Building your first big data application on AWS Tutorial

Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Ian Meyers, Pratim Das, and Ian Robinson walk you through building a big data application in real time using a combination of open source technologies, including Apache Hadoop, Spark, and Zeppelin, as well as AWS managed services such as Amazon EMR, Amazon Kinesis, and more.

Olivier de Garrigues is an EMEA solutions lead at Trifacta. Olivier has seven years’ experience in analytics. Previously, he was technical lead for business analytics at Splunk and a quantitative analyst at Accenture and Aon.

Presentations

Data wrangling for insurance FinData

Drawing on use cases from Trifacta customers, the speaker explains how to leverage data wrangling solutions in the insurance industry to streamline, strengthen, and improve data analytics initiatives on Hadoop.

Yves-Alexandre de Montjoye is a lecturer at Imperial College London, a research scientist at the MIT Media Lab, and a postdoctoral researcher at Harvard IQSS. His research aims to understand how the unicity of human behavior impacts the privacy of individuals—through reidentification or inference—in large-scale metadata datasets such as mobile phone, credit cards, or browsing data. Previously, he was a researcher at the Santa Fe Institute in New Mexico, worked for the Boston Consulting Group, and acted as an expert for both the Bill and Melinda Gates Foundation and the United Nations. Yves-Alexandre was recently named an innovator under 35 for Belgium. His research has been published in Science and Nature Scientific Reports and has been covered by the BBC, CNN, the New York Times, the Wall Street Journal, Harvard Business Review, Le Monde, Die Spiegel, Die Zeit, and El Pais as well as in his TEDx talks. His work on the shortcomings of anonymization has appeared in reports of the World Economic Forum, United Nations, OECD, FTC, and the European Commission. He is a member of the OECD Advisory Group on Health Data Governance. Yves-Alexandre holds a PhD in computational privacy from MIT, an MSc in applied mathematics from Louvain, an MSc (centralien) from École Centrale Paris, an MSc in mathematical engineering from KU Leuven, and a BSc in engineering from Louvain.

Presentations

Computational privacy and the OPAL project: Using big personal data safely Session

Yves-Alexandre de Montjoye shows how metadata can work as a fingerprint to identify people in a large-scale metadata database even though no “private” information was ever collected, shares a formula that can be used to estimate the privacy of a dataset if you know its spatial and temporal resolution, and offers an overview of OPAL, a project that enables safe big data use using modern CS tools.

Emma Deraze is a data scientist with TES Global and a volunteer at DataKind UK.

Presentations

Open corporate ownership data Session

Emma Deraze explores a collaborative project between DataKind, Global Witness, and Open Corporates to analyze open UK corporate ownership data and presents findings and insights into the challenges facing open official data, specifically in the context of an international setting, such as complex corporate networks.

Ding Ding is a software engineer on Intel’s Big Data Technology team, where she works on developing and optimizing distributed machine learning and deep learning algorithms on Apache Spark, focusing particularly on large-scale analytical applications and infrastructure on Spark.

Presentations

Distributed deep learning at scale on Apache Spark with BigDL HDS

Built on Apache Spark, BigDL provides deep learning functionality parity with existing DL frameworks—with better performance. Ding Ding explains how BigDL helps make the big data platform a unified data analytics platform, enabling more accessible deep learning for big data users and data scientists.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Big data governance for the hybrid cloud: Best practices and how-to Session

Big data needs governance—not just for compliance but also for data scientists. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start, especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Vikas Singh share a step-by-step approach to kickstart your big data governance.

Ted Dunning has been involved with a number of startups—the latest is MapR Technologies, where he is chief application architect working on advanced Hadoop-related technologies. Ted is also a PMC member for the Apache Zookeeper and Mahout projects and contributed to the Mahout clustering, classification, and matrix decomposition algorithms. He was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Presentations

Meet the Expert with Ted Dunning (MapR Technologies) Meet the Experts

Ted will talk streaming architecture, machine learning, and geodistributed data.

Tensor abuse in the workplace Session

Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case).

Yuval Dvir is head of online partnerships at Google Cloud, where he helps organizations change and transform by adopting a Lean, Agile, and modern ways of working powered by Google’s Cloud Platform and G Suite infrastructure and productivity suite. Previously, he led led product strategy and operations across search ads, display, programmatic, YouTube, and shopping. Yuval is a digital transformation executive with 15 years’ experience combining deep product knowledge, rich data insights, and strategic operations know-how to lead change, innovation, and growth in global organizations. As Microsoft’s global head of business transformation, he rebuilt Skype’s data infrastructure and visualization layer to be later managed under a newly designed Global Insights team. The transformation effort created a modern digital ecosystem, hardwiring product, engineering, and business functions to it across all levels, making it the de facto operating model of Skype. As Skype’s lead for product strategy, he radically accelerated the shift to mobile and cloud, streamlined the user experience for a similar look and feel across all platforms, drove the migration of hundreds of millions of Skype and MSN Messenger customers onto a single network under the Skype brand, and co-led the merger with Lync to become a unified consumer and enterprise global business. Yuval is an industry speaker, evangelist, and thought leader on developing and leading high-performance teams, divisions, and companies using analytics, culture, and agility as the main pillars. He holds a BSc from the Technion, Israel’s Institute of Technology, and an MBA from INSEAD Business School in France and Singapore.

Presentations

A wealth of information leads to a poverty of attention: Why adopting the cloud can help you stay focused on the right things Session

In an era when we are bombarded with data and tasks to finish, our ability to focus our attention becomes critical. When 70% of our code is for DevOps purposes and 90% of our data is dark, the cloud is a welcome, secure, and efficient relief. Yuval Dvir refutes common misconceptions about the cloud and explains why it's not a matter of "if" but "when" you'll move to the cloud.

Jorn Eilander is a Hadoop DevOps Engineer at ING. Jorn has lots of experience working with Hadoop in high-risk impacted enterprise environments as a data ingestion expert and Hadoop system engineer. As an IoT and home automation enthusiast, he’s worked with several Raspberry Pi and Arduino platforms to gather data for his Hadoop cluster.

Presentations

Daddy, what color is that airplane overhead, and where is it going? Session

Hellmar Becker and Jorn Eilander explore the real-time collection and predictive analytics of flight radar data with IoT devices, NiFi, HBase, Spark, and Zeppelin.

Alon Elishkov is a software engineer and team leader at Outbrain, where he leads the Data Infrastructure team. With over 13 years’ experience in the industry and a true passion for scaling solutions, Alon focuses on building self-healing large-scale distributed data delivery and processing infrastructure products and deploying them in demanding production environments.

Presentations

Migrating petabyte-scale Hadoop clusters with zero downtime Session

Migrating petabyte-scale Hadoop installations to a new cluster with hundreds of machines, several thousands of jobs daily, and countless ecosystem integrations while maintaining a stable production environment is a challenging task. Alon Elishkov discusses the techniques and tools Outbrain has developed to achieve this goal.

Wael Elrifai is an avid technologist and management strategist bridging the divide between IT and business as Pentaho’s EMEA director of enterprise solutions. Wael is a member of the Association for Computing Machinery, the Special Interest Group for Artificial Intelligence, the Royal Economic Society, and Chatham House. He holds graduate degrees in both electrical engineering and economics.

Presentations

Big data science, the IoT, and the transportation sector Tutorial

Wael Elrifai leads a journey through the design and implementation of a predictive maintenance platform for Hitachi Rail. The industrial internet, the IoT, data science, and big data make for an exciting ride.

Alvaro is a chemical engineer with postgraduate training in Data Science, Artificial Intelligence and Robotics. He has 20 years of experience in the financial sector in several leading banks (BBVA, HSBC, La Caixa and Santander). In 2007 he joined Santander as CRM director, where he has developed his career until recently being named Chief Risk Data Officer. Under his responsibility are the areas of methodology and modelling, Big Data in risk and Risk Management Information.

Presentations

Continuous analytics: Integrating the data hub in a DevOps pipeline Session

Arturo Bayo and Alvaro Fernandez Velando explain how a data hub strategy helps clarify data sharing and governance in an organization and share one way to implement a data hub architecture using big data technology and resources that are already established in the enterprise.

Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Eugene Fratkin, Philip Langdale, David Tishgart, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Chris Fregly is a research scientist at PipelineIO, a San Francisco-based streaming machine learning and artificial intelligence startup. Previously, Chris was a distributed systems engineer at Netflix, a data solutions engineer at Databricks, and a founding member of the IBM Spark Technology Center in San Francisco. Chris is a regular speaker at conferences and meetups throughout the world. He’s also an Apache Spark contributor, a Netflix Open Source committer, founder of the Global Advanced Spark and TensorFlow meetup, author of the upcoming book Advanced Spark, and creator of the upcoming O’Reilly video series Deploying and Scaling Distributed TensorFlow in Production.

Presentations

Deploy Spark ML TensorFlow AI models from notebooks to hybrid clouds (including GPUs) Session

Chris Fregly explores an often-overlooked area of machine learning and artificial intelligence—the real-time, end-user-facing "serving” layer in hybrid-cloud and on-premises deployment environments—and shares a production-ready environment to serve your notebook-based Spark ML and TensorFlow AI models with highly scalable and highly available robustness.

Ellen Friedman is a solutions consultant, scientist, and O’Reilly author currently writing about a variety of open source and big data topics. Ellen is a committer on the Apache Drill and Mahout projects. With a PhD in biochemistry and years of work writing on a variety of scientific and computing topics, she is an experienced communicator. Ellen is coauthor of Streaming Architecture, the Practical Machine Learning series from O’Reilly, Time Series Databases, and her newest title, Introduction to Apache Flink. She’s also coauthor of a book of magic-themed cartoons, A Rabbit Under the Hat. Ellen has been an invited speaker at Strata + Hadoop in London, Berlin Buzzwords, the University of Sheffield Methods Institute, and the Philly ETE conference and a keynote speaker for NoSQL Matters 2014 in Barcelona.

Presentations

Making a change: Digital transformation and organizational culture Data 101

Big data and emerging technologies offer powerful benefits, but for an organization to use them to their full advantage, a change in organizational culture is required. Ellen Friedman offers practical guidance on how to adopt an organizational culture that supports digital transformation, using examples from a variety of business use cases.

Laura Frolich is a data scientist at Think Big, where she is dedicated to utilizing data to discover patterns and underlying structure to enable optimization of businesses and processes. Previously, Laura was part of a research group investigating nonspecific effects of vaccines, using survival analysis methods. Laura holds a PhD from the Technical University of Denmark. For her thesis, ”Decomposition and Classification of Electroencephalography Data,” Laura used existing unsupervised methods and supervised classification techniques to understand brain activity through recordings of EEG and developed rigorous, interpretable classification methods for multidimensional (tensor) data.

Presentations

Enterprise artificial intelligence Session

Laura Frolich explores applications of deep learning in companies—looking at practical examples of assessing the opportunity for AI, phased adoption, and lessons going from research to prototype to scaled production deployment—and discusses the future of enterprise AI.

Maosong Fu is the technical lead for ​Heron and ​real-time analytics at Twitter and the author of ​few publications in the distributed area​. Maosong holds a master’s degree from Carnegie Mellon University and bachelor’s from Huazhong University of Science and Technology.

Presentations

Speeding up Twitter Heron streaming by 5x Session

Twitter processes billions of events per day at the instant the data is generated. To achieve real-time performance, Twitter employs Heron, an open source streaming engine tailored for large-scale environments. Sanjeev Kulkarni and Maosong Fu share several optimizations implemented in Heron to improve throughput by 5x and reduce latency by 50–60%.

Yupeng Fu is a software engineer at Alluxio and a PMC member of the Alluxio open source project. Previously, Yupeng worked at Palantir, where he led the efforts building the company’s storage solution. Yupeng holds a BS and an MS from Tsinghua University and has completed coursework toward a PhD at UCSD.

Presentations

Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x performance improvement to Qunar’s streaming processing Session

Alluxio—the first memory-speed virtual distributed storage system in the world—unifies the data from various under storage systems and presents a global namespace to various computation frameworks. Xueyan Li and Yupeng Fu explore how Alluxio has led to performance improvements averaging a 300x improvement at service peak time on stream processing workloads at Qunar.

Barbara Fusinska is a data solution architect with strong software development background and experience building diverse software systems for a variety of different companies. She believes in the importance of the data and metrics when growing a successful business. Besides collaborating around data architectures, Barbara also enjoys programming. She currently speaks at conferences in between working in London. Barbara tweets at @BasiaFusinska and blogs on Barbarafusinska.com.

Presentations

Deep learning with Microsoft Cognitive Toolkit Session

The popularity of deep learning is due in part to its capabilities in recognizing patterns from inputs such as images or sounds. Barbara Fusinska offers an overview of Microsoft Cognitive Toolbox, an open source framework offering various modules and algorithms enabling machines to learn like a human brain.

Eddie Garcia is chief information security officer at Cloudera, a leader in enterprise analytic data management, where he draws on his more than 20 years of information and data security experience to help Cloudera Enterprise customers reduce security and compliance risks associated with sensitive datasets stored and accessed in Apache Hadoop environments. Previously, Eddie was the vice president of infosec and engineering for Gazzang prior to its acquisition by Cloudera, where he architected and implemented secure and compliant big data infrastructures for customers in the financial services, healthcare, and public sector industries to meet PCI, HIPAA, FERPA, FISMA, and EU data security requirements. He was also the chief architect of the Gazzang zNcrypt product and is author of three patents for data security.

Presentations

Machine learning to "spot" cybersecurity incidents at scale Session

The use of big data and machine learning to detect and predict security threats is a growing trend, with interest from financial institutions, telecommunications providers, healthcare companies, and governments alike. Eddie Garcia explores how companies are using Apache Hadoop-based approaches to protect their organizations and explains how Apache Spot is tackling this challenge head-on.

Bas Geerdink is a programmer, scientist, and IT manager at ING, where he is responsible for the fast data systems that process and analyze streaming data. Bas has a background in software development, design, and architecture with broad technical experience from C++ to Prolog to Scala. His academic background is in artificial intelligence and informatics. Bas’s research on reference architectures for big data solutions was published at the IEEE conference ICITST 2013. He occasionally teaches programming courses and is a regular speaker at conferences and informal meetings.

Presentations

Fast data at ING: Utilizing Kafka, Spark, Flink, and Cassandra for data science and streaming analytics Session

As a data-driven enterprise, ING is heavily investing in big data, analytics, and stream processing. Bas Geerdink shares three use cases at ING and discusses their respective architectures and technology. All software is currently in production, running with modern tools such as Kafka, Cassandra, Spark, Flink, and H2O.ai.

Aurélien Géron is a machine-learning consultant. Previously, he led YouTube’s video classification team and was founder and CTO of two successful companies (a telco operator and a strategy firm). Aurélien is the author of several technical books, including the O’Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow.

Presentations

How knowledge graphs can help dramatically improve recommendations Session

Collaborative filtering is great for recommendations, yet it suffers from the cold-start problem. New content with no views is ignored, and new users get poor recommendation. Aurélien Géron shares a solution: knowledge graphs. With a knowledge graph, you can truly understand your users' interests and make better, more relevant recommendations.

Colin Gillespie is a senior lecturer at Newcastle University, UK, where he works on high-performance statistical computing and Bayesian statistics. Colin is also lead consultant at Jumping Rivers. He has been teaching R since 2005 at a variety of levels, ranging from beginning to advanced programming. Colin is author of the upcoming O’Reilly book Efficient R Programming.

Presentations

Efficient R programming Session

R has the reputation for being slow. Colin Gillespie covers key ideas and techniques for making your R code as efficient as possible, from R setup to common R coding problems to linking R with C++ for an extra speed boost.

Meet the Expert with Colin Gillespie (Jumping Rivers | Newcastle University) Meet the Experts

Colin is the author of Efficient R Programming, so if you have questions about working with R, stop by and see him.

Anthony Goldbloom is cofounder and CEO of Kaggle. In 2011 and 2012, Forbes magazine named Anthony one of the 30 under 30 in technology, and in 2013, MIT Tech Review named him one of top 35 innovators under the age of 35. He was given the alumni of distinction award by the University of Melbourne, where he holds a first-call honors degree in econometrics. Anthony has been published in the Economist and Harvard Business Review.

Presentations

Meet the Expert with Anthony Goldbloom (Kaggle) Meet the Experts

What can you learn from nearly one million data scientists building more than two million machine-learning models? Stop by to meet Anthony and find out.

What Kaggle has learned from almost a million data scientists Keynote

Kaggle is a community of almost a million data scientists, who have built more than two million machine-learning models while participating in Kaggle competitions. Data scientists come to Kaggle to learn, collaborate, and develop the state of the art in machine learning. Anthony Goldbloom shares lessons learned from top performers in the Kaggle community.

Miguel González-Fierro is a data scientist at Microsoft UK, where his job consists of helping customers leverage their processes using big data and machine learning. Previously, he was CEO and founder of Samsamia Technologies, a company that created a visual search engine for fashion items allowing users to find products using images instead of words, and founder of the Robotics Society of Universidad Carlos III, which developed different projects related to UAVs, mobile robots, small humanoids competitions, and 3D printers. Miguel also worked as a robotics scientist at Universidad Carlos III of Madrid and King’s College London, where his research focused on learning from demonstration, reinforcement learning, computer vision, and dynamic control of humanoid robots. He holds a BSc and MSc in electrical engineering and an MSc and PhD in robotics.

Presentations

Mastering computer vision problems with state-of-the art deep learning architectures, MXNet, and GPU virtual machines Session

Deep learning is one of the most exciting techniques in machine learning. Miguel González-Fierro explores the problem of image classification using ResNet, the deep neural network that surpassed human-level accuracy for the first time, and demonstrates how to create an end-to-end process to operationalize deep learning in computer vision for business problems using Microsoft RServer and GPU VMs.

Speeding up machine-learning applications with the LightGBM library in real-time domains HDS

The speed of a machine-learning algorithm can be crucial in problems that require retraining in real time. Mathew Salvaris and Miguel González-Fierro introduce Microsoft's recently open sourced LightGBM library for decision trees, which outperforms other libraries in both speed and performance, and demo several applications using LightGBM.

Martin Goodson is the chief scientist and CEO of Evolution AI, where he specializes in large-scale statistical computing and natural language processing. Martin has designed data science products that are in use at companies like Time Inc., Hearst, John Lewis, Condé Nast, and Buzzfeed. Previously, Martin worked as a statistician at the University of Oxford, where he conducted research on statistical matching problems for DNA sequences.

Presentations

10 ways your data project is going to fail and how to prevent it Tutorial

Data science continues to generate excitement, and yet real-world results can often disappoint business stakeholders. Martin Goodson offers a personal perspective on the most common failure modes of data science projects and discusses current best practices.

Artificial intelligence in the enterprise Session

Martin Goodson gives a tell-all account of an ultimately successful installation of a deep learning system in an enterprise environment. Andy Crisp then shares insights into the challenges of integrating artificial intelligence systems into real-world business processes.

Martin Görner works with developer relations at Google. Martin is passionate about science, technology, coding, algorithms, and everything in between. Previously, he worked in the computer architecture group of ST Microlectronics and spent 11 years shaping the nascent ebook market, starting at Mobipocket, a startup that later became the software part of the Amazon Kindle and its mobile variants. He graduated from Mines Paris Tech.

Presentations

TensorFlow and deep learning (without a PhD) Session

With TensorFlow, deep machine learning has transitioned from an area of research into mainstream software engineering. Martin Görner walks you through building and training a neural network that recognizes handwritten digits with >99% accuracy using Python and TensorFlow.

As training lead at Mango, Aimee Gott has delivered over 200 days of training, including onsite training courses in Europe and the US in all aspects of R as well as shorter workshops and online webinars. Aimee oversees Mango’s training course development across the data science pipeline and regularly attends R user groups and meetups. Aimee is also a coauthor of Sams Teach Yourself R in 24 Hours. Aimee holds a PhD in statistics from Lancaster University.

Presentations

Spark and R with sparklyr Tutorial

R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Aimee Gott, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.

Trent Gray-Donald is a distinguished engineer in IBM Analytics’s Analytic Platform Services organization, where he works on analytics services for IBM Bluemix, including IBM Big Insights and Apache Spark. Previously, Trent worked on high-speed in-memory analytics solutions, such as Cognos BI and DB2 BLU. He was a member of the IBM Java Technology Centre and was overall technical lead on the IBM Java 7 project. Trent holds a bachelor of mathematics in computer science from the University of Waterloo, Canada.

Presentations

Hadoop and object stores: Can we do it better? Session

Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark.

Mark Grover is a software engineer working on Apache Spark at Cloudera. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating) and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures AMA

Mark Grover, Ted Malaska, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

What no one tells you about writing a streaming app? Session

Any nontrivial streaming app requires that you consider a number of important topics, but questions like how to manage offsets or state often go unanswered. Mark Grover and Ted Malaska share practices that no one talks about when you start writing a streaming app but that you'll inevitably need to learn along the way.

Adam Grzywaczewski is a deep learning solution architect at NVIDIA, where his primary responsibility is to support a wide range of customers in delivery of their deep learning solutions. Adam is an applied research scientist specializing in machine learning with a background in deep learning and system architecture. Previously, he was responsible for building up the UK government’s machine-learning capabilities while at Capgemini and worked in the Jaguar Land Rover Research Centre, where he was responsible for a variety of internal and external projects and contributed to the self-learning car portfolio.

Presentations

Deep learning: Assessing analytics project feasibility and computational requirements Session

Adam Grzywaczewski offers an overview of the types of analytical problems that can be solved using AI and shares a set of heuristics that can be used to evaluate the feasibility of analytical AI projects. Adam then covers the computational requirements for the deep learning training process, leaving you with the key tools you need to initiate an analytical AI project.

Luke Han is the cofounder and CEO of Kyligence as well as the cocreator and PMC chair of Apache Kylin, where he drives the project’s strategy, roadmap, and product design and works on growing Apache Kylin’s community, building its ecosystem, and extending adoptions. Previously, Luke was big data product lead at eBay, where he managed Apache Kylin, engaging customers and coordinating various teams from different geographies, and chief consultant at Actuate China.

Presentations

Apache Kylin use cases in China Session

Apache Kylin is rapidly being adopted over the world—especially in China. Luke Han explores how various industries use Apache Kylin, sharing why these companies choose Apache Kylin (a technology comparison), how they use Apache Kylin (their production deployment pattern), and most importantly, the resulting business impact.

Meet the Expert with Luke Han (Kyligence) Meet the Experts

Luke is cocreator and vice preseident of Apache Kylin, the top open source OLAP on Hadoop. Come chat with Luke about Apache Kylin, OLAP on Hadoop, and interesting use cases.

Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic-net regularization in Spark’s ML library and one-pass elastic-net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine-learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

Presentations

Building a scalable recommendation engine with Spark and Elasticsearch Session

There are many resources available for learning how to use Spark to build collaborative filtering models. However, there are relatively few that explain how to build a large-scale, end-to-end recommender system. Seth Hendrickson demonstrates how to create such a system using Spark Streaming, Spark ML, and Elasticsearch.

Spark Structured Streaming for machine learning Session

Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model.

Nicolaus Henke is a senior partner at McKinsey & Company, where he advises leading companies on how to improve decision making and performance through advanced analytics, artificial intelligence, and end-to-end data enabled transformations, and is the global leader of McKinsey Analytics, which has more than 2,000 dedicated analytics practitioners and translators, where he oversees partnerships between McKinsey and the wider artificial intelligence, data, and computing ecosystem. Nicolaus is the chairman of QuantumBlack (acquired by McKinsey in 2015), a company operating at the intersection of strategy, technology, and design, and a member of McKinsey’s global board, the Shareholder’s Council. He helped found and is a member of the Board of Innovative Healthcare Delivery at Duke Medicine and serves on the Dean’s Advisory Council at Harvard’s Kennedy School. Nicolaus frequently speaks on big data, analytics, and healthcare topics at global forums such as the World Economic Forum, the Milken Institute, and Forbes. He holds a master’s degree and PhD in business from the University of Münster, Germany, and a master’s in public administration from Harvard University, where he was a John J. McCloy Scholar.

Presentations

Executive Briefing: What CEOs think about AI and how to drive adoption across the enterprise Session

Nicolaus Henke explores what CEOs currently think about AI and explains how to drive adoption successfully across the enterprise.

Grace Huang is the data science lead for discovery at Pinterest, where discovery products like recommendations and personalization are developed. She is passionate about building data science products around machine-learning algorithms to drive better experience for Pinterest users and build a sustainable ecosystem.

Presentations

Peeking into the black box: Lessons from the front lines of machine-learning product launches Keynote

Grace Huang shares lessons learned from running and interpreting machine-learning experiments and outlines launch considerations that enable sustainable, long-term ecosystem health.

The mystery of the vanishing pins: Building a sustainable content ecosystem at Pinterest Session

Grace Huang offers a glimpse into the unique challenges of maintaining a healthy ecosystem around machine-learning products at Pinterest. Grace explores the suite of tools Pinterest built to make sense of machine-learning experiment results and the panel of metrics it developed to help gauge the health of the content ecosystem.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and spends currently a lot of time writing a book, Stream Processing with Apache Flink.

Presentations

Meet the Expert with Fabian Hueske (data Artisans) Meet the Experts

Fabian will discuss SQL on streams, stream analytics use cases, and stream processing with Apache Flink.

Stream analytics with SQL on Apache Flink Session

Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage.

Ali Hürriyetoglu is a data scientist at Statistics Netherlands and a PhD candidate at Radboud University, where his research focuses on the information extraction and social media analysis fields. He has a background in computer science. He has worked for a number of organizations, including EU JRC and Appen, in the language technologies area and recently completed a five-month traineeship at Netbase Solutions Inc. in Santa Clara, CA, where he focused on Turkish morphological analysis and sentiment analysis. Born to a family with Arabic origins in Turkey, Ali is fluent in Arabic, Turkish, English, German, Italian, and Dutch. He holds an undergraduate degree in computer engineering from Ege University in Izmir, Turkey, and a master’s degree in cognitive science at Middle East Technical University in Ankara, Turkey. During his undergraduate studies, he was an exchange student at Technische Hochschule Mittelhessen in Giessen, Germany.

Presentations

Relevancer: Finding and labeling relevant information in tweet collections Session

Identifying relevant tweets in tweet collections that are gathered via key words is a huge challenge. Ali Hürriyetoglu and Nelleke Oostdijk share the results of a study on using unsupervised and supervised machine learning with linguistic insight to enable people to identify relevant tweets for their needs and offer an overview of their tool, Relevancer.

Mads Ingwar is the client services director at Think Big, a Teradata company, where he is responsible for leading consulting teams delivering data science and big data analytics combining Hadoop and Spark, the public cloud, and traditional data warehousing. Mads has a proven track record in using data science and big data to bring measurable added value to everything from startups to Fortune 500 companies. Mads has a strong background in pervasive computing and sensor-based technologies and holds a PhD from the Technical University of Denmark, where his research focused on machine learning and data analysis.

Presentations

Driving business value: Predicting piston ring failures in massive vessels Session

Eliano Marques and Mads Ingwar share a case study on how to leverage data science to plan ship engine maintenance by warning about potential piston ring failure.

Jeroen Janssens is the founder of Data Science Workshops, which provides on-the-job training and coaching in data visualization, machine learning, and programming. One day a week, Jeroen is also an assistant professor at Jheronimus Academy of Data Science. Previously, he was a data scientist at Elsevier in Amsterdam and at the startups YPlan and Outbrain in New York City. He is the author of Data Science at the Command Line, published by O’Reilly. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Presentations

Create interactive maps in seconds with R and Leaflet Session

Leaflet, one of the most popular open source JavaScript libraries for interactive maps, is used by websites ranging from the New York Times and the Washington Post to GitHub and Flickr, as well as GIS specialists like OpenStreetMap, Mapbox, and CartoDB. Jeroen Janssens explains how the Leaflet R package makes it easy to integrate and control Leaflet maps in R.

Rekha Joshi is principal engineer working in Intuit’s Technology group, where she is responsible for designing and implementing large-scale intelligent distributed platform solutions. Previously, she delivered large-scale personalized solutions for internet scale at Yahoo. Rekha has worked in diverse domains of finance, advertising, supply chain, and AI research. Her refueling stops include reading Issac Asimov, Richard Feynman, and PG Wodehouse and stalking Elon Musk.

Presentations

Performance and security: A tale of two cities Session

Performance and security are often at loggerheads. Rekha Joshi explains why and offers a deep dive into how performance and security are managed in some of the most intense and critical data platform services at Intuit.

Ismael Juma is a Kafka committer and engineer at Confluent, where he is building a stream data platform based on Apache Kafka. Earlier, he worked on automated data balancing. Previously, Ismael was the lead architect at Time Out, where he was responsible for the data platform at the core of Time Out’s international expansion and print to digital transition. Ismael has contributed to several open source projects, including Voldemort and Scala.

Presentations

Elastic streams: Dynamic data redistribution in Apache Kafka Session

Dynamic data rebalancing is a complex process. Ben Stopford and Ismael Juma explain how to do data rebalancing and use replication quotas in the latest version of Apache Kafka.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

Driving the next wave of data lineage with automation, visualization, and interaction Session

Sean Kandel offers an overview of an entirely new approach to visualizing metadata and data lineage, explaining how to track how different attributes of data are derived during the data preparation process and the associated linkages across different elements in the data.

Amit Kapoor is interested in learning and teaching the craft of telling visual stories with data. At narrativeVIZ Consulting, Amit uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Amit also teaches storytelling with data as guest faculty in executive courses at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. He has more than 12 years of management consulting experience with AT Kearney in India, Booz & Company in Europe, and more recently for startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi and a PGDM (MBA) from IIM, Ahmedabad. Find more about him at Amitkaps.com.

Presentations

Interactive data visualizations using Visdown Tutorial

Crafting interactive data visualizations for the web is hard—you're stuck using proprietary tools or must become proficient in JavaScript libraries like D3. But what if creating a visualization was as easy as writing text? Amit Kapoor and Bargava Subramanian outline the grammar of interactive graphics and explain how to use declarative markdown-based tool Visdown to build them with ease.

Meet the Expert with Amit Kapoor (narrativeVIZ Consulting) and Bargava Subramanian (Red Hat) Meet the Experts

Stop by and talk with Amit and Bargava about any aspect of narrative and interactive visualization or model (ML) visualization.

Holden Karau is transgender Canadian, Apache Spark committer, an active open source contributor, and co-author of Learning Spark & High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science.

Presentations

Debugging Apache Spark Session

Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau explores how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.

Spark Structured Streaming for machine learning Session

Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model.

Mubashir Kazia is a solutions architect at Cloudera focusing on security. Mubashir started the initiative integrating Cloudera Manager with Active Directory for kerberizing the cluster and provided sample code. Mubashir has also contributed patches to Apache Hive that fixed security-related issues.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.

Presentations

Creating real-time, data-centric applications with Impala and Kudu Session

Marcel Kornacker offers an introduction to using Impala and Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL on Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Aljoscha Krettek is a PMC member at Apache Beam and Apache Flink, where he mainly works on the Streaming API and also designed and implemented the most recent additions to the Windowing and State APIs. Aljoscha is a cofounder and software engineer at data Artisans. Previously, he worked at IBM Germany and at the IBM Almaden Research Center in San Jose. He studied computer science at TU Berlin.

Presentations

Unified stateful big data processing in Apache Beam (incubating) Session

Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Aljoscha Krettek introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.

Sanjeev Kulkarni is the cofounder of Streamlio, a company focused on building a next-generation real-time stack. Previously, he was the technical lead for real-time analytics at Twitter, where he cocreated Twitter Heron; worked at Locomatix handling the company’s engineering stack; and led several initiatives for the AdSense team at Google. Sanjeev holds an MS in computer science from the University of Wisconsin, Madison.

Presentations

Speeding up Twitter Heron streaming by 5x Session

Twitter processes billions of events per day at the instant the data is generated. To achieve real-time performance, Twitter employs Heron, an open source streaming engine tailored for large-scale environments. Sanjeev Kulkarni and Maosong Fu share several optimizations implemented in Heron to improve throughput by 5x and reduce latency by 50–60%.

Scott Kurth is the vice president of advisory services at Silicon Valley Data Science, where he helps clients define and execute the strategies and data architectures that enable differentiated business growth. Building on 20 years of experience making emerging technologies relevant to enterprises, he has advised clients on the impact of technological change, typically working with CIOs, CTOs, and heads of business. Scott has helped clients drive global technology strategy, conduct prioritization of technology investments, shape alliance strategy based on technology, and build solutions for their businesses. Previously, Scott was director of the Data Insights R&D practice within the Accenture Technology Labs, where he led a team focused on employing emerging technologies to discover the insight contained in data and effectively bring that insight to bear on business processes to enable new and better outcomes and even entire new business models, and led the creation of Accenture’s annual analysis of emerging technology trends impacting the future of IT, Technology Vision, where he was responsible for tracking emerging technologies, analyzing their transformational potential, and using it to influence technology strategy for both Accenture and its clients.

Presentations

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Scott Kurth, and Stephen O'Sullivan field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for (and the evolving role of) the CDO. Even if you don’t have a specific question, join in to hear what others are asking.

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and John Akred explain how to create a modern data strategy that powers data-driven business.

Philip Langdale is the engineering lead for cloud at Cloudera. He joined the company as one of the first engineers building Cloudera Manager and served as an engineering lead for that project until moving to working on cloud products. Previously, Philip worked at VMware, developing various desktop virtualization technologies. Philip holds a bachelor’s degree with honors in electrical engineering from the University of Texas at Austin.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Eugene Fratkin, Philip Langdale, David Tishgart, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

How to optimally run Cloudera batch data engineering workflows in AWS Session

Cloudera Enterprise has made many focused optimizations in order leverage all of the cloud-native capabilities of AWS for the CDH platform. Andrei Savu and Philip Langdale take you through all the ins and outs of successfully running end-to-end batch data engineering workflows in AWS and demonstrate a Cloudera on AWS data engineering workflow with a sample use case.

Alon Lebenthal is a senior manager in workload automation solutions marketing at BMC. Alon has over 24 years of experience in the IT industry. Previously, he held various leadership positions in brand management, channels, and solutions marketing. Alon is a regular speaker at big data conferences and BMC events around the world.

Presentations

Ingest, process, analyze: Automation and integration through the big data journey Session

Neil Cullum and Alon Lebenthal demonstrate how BMC can help automate every aspect of the big data journey with Control-M’s enterprise-grade automation capabilities and job-as-code approach, helping you deliver big data projects faster and better.

Damien Lefortier is a machine-learning engineer on the Ads Ranking team at Facebook. Previously, Damien worked on the core Machine Learning team at Criteo, where he helped improve Criteo’s predictive algorithms for ad targeting, and on the Search team at Yandex, where he focused on search quality and infrastructure. He is working toward a PhD in information retrieval at the University of Amsterdam. His research work has been published at top tier conferences, such as WWW and CIKM.

Presentations

Machine learning with partial and biased feedback Session

There are use cases where the only accessible feedback for training machine-learning models is partial and biased (e.g., when feedback is obtained through surveys). Damien Lefortier shares methods to handle these cases and explains how to ensure that they are performing well.

Xueyan Li is a data platform R&D engineer at Qunar, where he is mainly responsible for the continuous integrated development of resource management system Mesos and distributed memory management system Alluxio, as well as data for all business lines based on public service support. Other focuses include the ELK log ETL platform, Spark, Storm, Flink, and Zeppelin. He holds a degree in software engineering from Heilongjiang University.

Presentations

Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x performance improvement to Qunar’s streaming processing Session

Alluxio—the first memory-speed virtual distributed storage system in the world—unifies the data from various under storage systems and presents a global namespace to various computation frameworks. Xueyan Li and Yupeng Fu explore how Alluxio has led to performance improvements averaging a 300x improvement at service peak time on stream processing workloads at Qunar.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Hardcore Data Science welcome Tutorial

Ben Lorica and Angie Ma welcome you to the all-day Hardcore Data Science tutorial.

Nir Lotan is a machine-learning product manager and team manager in Intel’s Advanced Analytics department. Nir’s team develops machine-learning and deep learning-related tools, including a tool that enables easy creation of deep learning models. Prior to this role, Nir held several product, system, and software management positions within Intel’s Design Center organization and other leading companies. Nir has 15 years of experience in software and systems engineering, products, and management. He holds a BSc degree in computer engineering from the Technion Institute of Technology.

Presentations

Faster deep learning solutions from training to inference Session

Barak Rozenwax and Nir Lotan explain how to easily train and deploy deep learning models for image and text analysis problems using Intel's Deep Learning SDK, which enables you to use deep learning frameworks that were optimized to run fast on regular CPUs, including Caffe and TensorFlow.

Eric Lotter is vice president of technical services at WANdisco, where he oversees the delivery of all technical solutions, including product configurations. Eric has over 20 years’ experience building analytics, integration, and big data systems. Previously, Eric was the director of sales engineering at Virtual Bridges, where he ran the Global SE department.

Presentations

Replication as a service Session

Eric Lotter offers an overview of WANdisco's strongly consistent replication service for replicating between cloud object stores, HDFS, NFS, and other S3- and Hadoop-compatible filesystems.

Alison Lowndes is a solution architect and community manager at NVIDIA. Alison has 25+ years in international project management and entrepreneurship with two decades spent within the internet arena. In her spare time, she is a founder trustee of a global volunteering network. A very recent graduate in artificial intelligence at the University of Leeds, where she where she completed a thorough empirical study of deep learning, specifically with GPU technology, covering the entire history and technical aspects of GPGPU with underlying mathematics, Alison combines technical and theoretical computer science with a physics background.

Presentations

Deep learning for object detection and neural network deployment Tutorial

Alison Lowndes leads a hands-on exploration of approaches to the challenging problem of detecting if an object of interest is present within an image and, if so, recognizing its precise location within the image. Along the way, Alison walks you through testing three different approaches to deploying a trained DNN for inference.

Angie Ma is cofounder and COO of ASI Data Science, a London-based AI tech startup that offers data science as a service, which has completed more than 120 commercial data science projects in multiple industries and sectors and is regarded as the EMEA-based leader in data science. Angie is passionate about real-world applications of machine learning that generate business value for companies and organizations and has experience delivering complex projects from prototyping to implementation. A physicist by training, Angie was previously a researcher in nanotechnology working on developing optical detection for medical diagnostics.

Presentations

Ask me anything: Data science applications and deployment AMA

Angie Ma and Scott Stevenson share their experience and lessons learned from having worked on over 160 commercial data science projects with over 120 organizations from different sectors and industries.

Hardcore Data Science welcome Tutorial

Ben Lorica and Angie Ma welcome you to the all-day Hardcore Data Science tutorial.

Mark Madsen is a research analyst at Third Nature, where he advises companies on data strategy and technology planning. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide. He focuses on two types of work: the business applications of data and guiding the construction of data infrastructure. As a result, Mark does as much information strategy and IT architecture work as he does performance management and analytics.

Presentations

Executive Briefing: Dealing with device data Session

In 2007, a computer game company decided to jump ahead of competitors by capturing and using data created during online gaming, but it wasn't prepared to deal with the data management and process challenges stemming from distributed devices creating data. Mark Madsen shares a case study that explores the oversights, failures, and lessons the company learned along its journey.

Organizing the data lake Session

Building a data lake involves more than installing and using Hadoop. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen discusses hidden design assumptions, reviews design principles to apply when building multiuse data infrastructure, and provides a reference architecture.

Roger Magoulas is the research director at O’Reilly Media and chair of the Strata + Hadoop World conferences. Roger and his team build the analysis infrastructure and provide analytic services and insights on technology-adoption trends to business decision makers at O’Reilly and beyond. He and his team find what excites key innovators and use those insights to gather and analyze faint signals from various sources to make sense of what others may adopt and why.​

Presentations

Thursday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Roland Major is an enterprise architect at Transport for London, where he works on the Surface Intelligent Transport System, which is looking to improve the operation of the roads network across London and provide greater insight from existing and new data sources using modern data analytic techniques. Previously, Roland worked on event-driven architectures and solutions in the nuclear, petrochemical, and transport industries.

Presentations

Transport for London: Using data to keep London moving Tutorial

Transport for London (TfL) and WSO2 have been working together on broader integration projects focused on getting the most efficient use out of London transport. Roland Major and Sriskandarajah Suhothayan explain how TfL and WSO2 bring together a wide range of data from multiple disconnected systems to understand current and predicted transport network status.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera helping clients find success with the Hadoop ecosystem and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures AMA

Mark Grover, Ted Malaska, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

What no one tells you about writing a streaming app? Session

Any nontrivial streaming app requires that you consider a number of important topics, but questions like how to manage offsets or state often go unanswered. Mark Grover and Ted Malaska share practices that no one talks about when you start writing a streaming app but that you'll inevitably need to learn along the way.

Nikolay Manchev is a data scientist on IBM’s Big Data technical team. He specializes in machine learning, data science, and big data. He is a speaker, blogger, and the organizer of the London Machine Learning Study Group meetup. Nikolay holds an MSc in software technologies and an MSc in data science, both from City University London.

Presentations

Multinode restricted Boltzmann machines for big data Session

Nikolay Manchev offers an overview of the restricted Boltzmann machine, a type of neural network with a wide range of applications, and shares his experience using it on Hadoop (MapReduce and Spark) to process unstructured and semistructured data at a scale.

Eliano Marques is the head of data science at Teradata International. Eliano has successfully led teams and projects to develop and implement analytics platforms, predictive models, and analytics operating models and has supported many businesses making better decisions through the use of data. Recently, Eliano has been focused in developing analytics solutions for customers around deep learning, AI, predictive asset maintenance, customer path analytics, and customer experience analytics across different industries. Eliano holds a degree in economics, an MSc in applied econometrics and forecasting, and several certifications in machine learning and data mining.

Presentations

Driving business value: Predicting piston ring failures in massive vessels Session

Eliano Marques and Mads Ingwar share a case study on how to leverage data science to plan ship engine maintenance by warning about potential piston ring failure.

David Martinez Rego is a data scientist at DataSpartan specializing in designing tailored software systems and algorithms. He has been part of laboratories in a number of academic institutions, including the University of A Coruña, the University of Florida, and University College London. Currently, he is based in London, where he divides his time doing research, lecturing, and consulting for different industries and startups.

Presentations

Principles of data science management Session

The growth of data science as a strategic discipline makes its correct management paramount to the survival of new and traditional businesses that want to compete in a foreseeable data-driven economy. David Martinez Rego shares a set of sound, solid principles that will help increase your effectiveness as a data science manager.

As the CEO and cofounder of Silicon Valley Data Science, Sanjay Mathur has brought together a team of world-class data scientists and engineers to help companies become more data driven. Previously, Sanjay was a partner in Accenture’s R&D organization, Accenture Technology Labs, where he led a global team that delivered market-ready business solutions built using emerging technologies to Accenture’s clients and built three different analytics and data practices, including the Information Insight R&D team focusing on machine learning and the semantic web, the Analytics and Insight group focused on predictive analytics, and the Data and Platforms R&D group focused on analytics, big data, and virtualization. He was also SVP of product for LiveOps, where he was responsible for LiveOps’s overall product strategy and roadmap and designed and deployed social, mobile, multichannel, and analytic applications into the LiveOps Cloud Platform.

Presentations

The business case for deep learning, Spark, and friends Data 101

Deep learning is white-hot at the moment, but why does it matter? Developers are usually the first to understand why some technologies cause more excitement than others. Sanjay Mathur relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring.

Leah McGuire is a lead member of the technical staff at Salesforce Einstein, where she builds platforms to enable the integration of machine learning into Salesforce products. Previously, Leah was a senior data scientist on the data products team at LinkedIn working on personalization, entity resolution, and relevance for a variety of LinkedIn data products and completed a postdoctoral fellowship at the University of California, Berkeley. She holds a PhD in computational neuroscience from the University of California, San Francisco, where she studied the neural encoding and integration of sensory signals.

Presentations

Meta-data science: When all the world's data scientists are just not enough Session

What if you had to build more models than there are data scientists in the world—a feat enterprise companies serving hundreds of thousands of businesses often have to do? Leah McGuire offers an overview of Salesforce's general-purpose machine-learning platform that automatically builds per-company optimized models for any given predictive problem at scale, beating out most hand-tuned models.

Aida Mehonic is an engagement manager at ASI Data Science with a focus on financial services. Previously, she worked in investment banking for four years, most recently as a front office strategist at JPMorgan Investment Bank developing quantitative models and publishing investment research. Aida is a bronze medallist at the International Physics Olympiad. She holds a BA and MMath in mathematics from Cambridge University and a PhD in theoretical physics from UCL. Her research has been published in Nature.

Presentations

Deep learning in commodities markets Tutorial

Aida Mehonic explains how ASI Data Science has trained a deep neural net on historical prices of liquid financial contracts. The neural net has already outperformed comparable strategies based on expert systems.

Is finance ready for AI? FinData

Quantitative finance has been a key feature in the financial industry for over 30 years. Big data, machine learning, and AI are increasingly being used today, but is the financial industry actually ready for AI? Aida Mehonic explores some of the most common applications of AI in finance and shares the typical challenges of data transformation and AI adoption that financial institutions face.

Is finance ready for AI? Keynote

Aida Mehonic, Engagement Manager in Data Science, ASI Data Science

Ian Meyers is a principal solution architect at AWS, where he manages the Specialist Solution Architecture team in EMEA. He has a background in data management systems—specifically data warehousing, big data, and event-driven architectures—in the financial services, communications, and gaming industries.

Presentations

Building your first big data application on AWS Tutorial

Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Ian Meyers, Pratim Das, and Ian Robinson walk you through building a big data application in real time using a combination of open source technologies, including Apache Hadoop, Spark, and Zeppelin, as well as AWS managed services such as Amazon EMR, Amazon Kinesis, and more.

John Mikula is a tech lead for Google Cloud, where he manages the team focused on enterprise features for Google Cloud Dataproc.

Presentations

Architecting and building enterprise-class Spark and Hadoop in cloud environments Tutorial

John Mikula explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem.

Cory Minton is a principal systems engineer at EMC, where he works hand in hand with clients across the globe to assess and develop big data strategies, architect technology solutions, and insure successful deployments of these transformational initiatives. A geek, technology evangelist and business strategist, Cory is focused on finding creative ways for organizations to drive the utmost value from their data while transforming IT’s relevance to the organizations and customers they serve. With a diverse background in IT applications, consulting, data center infrastructure, and the expanding Data Fabric ecosystem, Cory brings an interesting perspective to the clients he serves while consistently challenging them to think bigger. Cory holds an undergraduate degree in engineering from Texas A&M University and an MBA from Tennessee Tech University. Cory resides in Birmingham, Alabama, with his beautiful wife and two awesome children.

Presentations

Architecture best practices for big data deployments Session

Cory Minton pulls back the covers on how big data applications impact underlying hardware based on real-world deployments and shares Dell EMC’s internal testing and benchmarking used to develop its architecture best practices. Along the way, Cory shows you how to get your architecture right the first time for optimal performance and scaling.

Mostafa Mokhtar is a performance engineer at Cloudera. Previously, he held similar roles at Hortonworks and on the SQL Server team at Microsoft.

Presentations

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL on Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Sherry Moore is a software engineer on the Google Brain team. Her other projects at Google include Google Fiber and Google Ads Extractor. Previously, she spent 14 years as a systems and kernel engineer at Sun Microsystems.

Presentations

The state of TensorFlow and where it is going in 2017 Session

Sherry Moore discusses TensorFlow progress and adoption over 2016 and looks ahead to TensorFlow efforts in future areas of importance, such as performance, usability, and ubiquity.

Jonathon Morgan is the CEO of New Knowledge, a startup using AI for digital messaging and intelligence. As part of his ongoing work combating violent extremism, Jonathon served as an advisor to the White House and State Department, coauthored the ISIS Twitter Census for the Brookings Institution, and develops new technology with DARPA. Jonathon is also a cohost of the surprisingly popular Partially Derivative podcast and a founding member of Data for Democracy, a volunteer platform for data science social impact projects.

Presentations

Fighting bad guys with data science Session

Jonathon Morgan explores computer vision, deep learning, and natural language processing techniques for uncovering communities of white nationalists and neo-Nazis on social media and identifying which ones are on the path to radicalization.

Alan Mosca is senior data engineer at Sendence and a part-time doctoral researcher at Birkbeck, University of London, where his research focuses on deep learning ensembles and improvements to optimization algorithms in deep learning. Previously, Alan worked at Wadhwani Asset Management, Jane Street Capital, and several software companies as well as on several consulting projects in machine learning and deep learning.

Presentations

Ensembles in deep learning with Toupee Tutorial

Alan Mosca discusses using ensembles in deep learning and tackles a benchmark problem in computer vision with Toupee, a library and toolkit for experimentation with deep learning and ensembles.

Barzan Mozafari is an assistant professor of computer science and engineering at the University of Michigan, Ann Arbor, where he leads a research group designing the next generation of scalable databases using advanced statistical models. Previously, Barzan was a postdoctoral associate at MIT. His research career has led to many successful open source projects, including CliffGuard (the first robust framework for database tuning), DBSeer (the first automated database diagnosis tool), and BlinkDB (the first massively parallel approximate query engine). Barzan has won the National Science Foundation CAREER award as well as several best paper awards in ACM SIGMOD and EuroSys. He is also a cofounder of DBSeer and a strategic advisor to SnappyData, a company that commercializes the ideas introduced by BlinkDB. Barzan holds a PhD in computer science from UCLA.

Presentations

Verdict: Platform-independent analytics and visualization at subsecond latencies Session

Visualization and exploratory analytics require subsecond interactions with massive volumes of data, a goal that has remained illusive due to numerous inefficiencies across the stack. Barzan Mozafari offers an overview of Verdict, an open source middleware that guarantees subsecond visualization and analytics and works with Impala, Spark, Hive, and most other engines in the Hadoop ecosystem.

Calum Murray is the chief data architect in the Small Business group at Intuit. Calum has 20 years’ experience in software development, primarily in the finance and small business spaces. Over his career, he has worked with various languages, technologies, and topologies to deliver everything from real-time payments platforms to business intelligence platforms.

Presentations

Journey to AWS: Straddling two worlds Session

As Intuit moves its SaaS platform from its own data centers to AWS, it will straddle both worlds for a period of time (and potentially indefinitely). Calum Murray looks at what straddling means to data and data systems.

Jacques Nadeau is the cofounder and CTO of Dremio. He is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Previously, Jacques was the architect and engineering manager for Drill and other distributed systems technologies at MapR and the CTO and cofounder of YapMap, an enterprise search startup. He also held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).

Presentations

Creating a virtual data lake with Apache Arrow Session

In most organizations, data is spread across multiple data sources, such as Hadoop/cloud storage, RDBMS, and NoSQL. Tomer Shiran and Jacques Nadeau offer an overview of Apache Arrow, an open source in-memory columnar technology that enables users to combine multiple data sources and expose them as a virtual data lake to users of Spark, SQL-on-Hadoop, Python, and R.

Paco Nathan leads the Learning group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

AI within O'Reilly Media Session

Paco Nathan explains how O'Reilly employs AI, from the obvious (chatbots, case studies about other firms) to the less so (using AI to show the structure of content in detail, enhance search and recommendations, and guide editors for gap analysis, assessment, pathing, etc.). Approaches include vector embedding search, summarization, TDA for content gap analysis, and speech-to-text to index video.

Computable content: Notebooks, containers, and data-centric organizational learning Session

O'Reilly recently launched Oriole, a new learning medium for online tutorials that combines Jupyter notebooks, video timelines, and Docker containers run on a Mesos cluster, based the pedagogical theory of computable content. Paco Nathan explores the system architecture, shares project experiences, and considers the impact of notebooks for sharing and learning across a data-centric organization.

Allison Nau is head of data solutions at Cox Automotive UK. Allison is a highly driven and self-motivated big data, analytics, and product executive with a proven track record in transforming businesses and driving strategic growth through data analysis and product development. Previously, Allison worked at LexisNexis, where she developed the entire product portfolio of data and analytics products for its expansion into the UK, leading to double-digit growth year on year for that new venture while transforming the motor insurance industry. A trained quantitative political scientist who got her start as a price optimization consultant, Allison holds a BA in mathematics and international relations from the College of Wooster and an MA in political science from the University of Michigan.

Presentations

Big data at Cox Automotive: Delivering actionable insights to transform the way the world buys, sells, and owns vehicles Tutorial

Twenty months into its big data journey, Cox Automotive is using a variety of tools and techniques to deliver actionable insights, transforming decision making within the automotive industry. Allison Nau shares lessons learned, including where to begin transforming a legacy business and industry to become more data driven, how to gain momentum, and how to deliver meaningful results at pace.

Arshak Navruzyan is a machine-learning-focused product manager and the founder of Startup.ML, a machine-learning fellowship program that has graduated over 30 data scientists now employed by companies including Uber, Facebook, and Baidu. Arshak has delivered AI solutions for multibillion dollar quantitative hedge funds, numerous venture-funded startups, and some of the largest telecoms in the world and has held technology leadership roles at Argyle Data, Alpine Data Labs, and Endeca/Oracle.

Presentations

Video anomaly detection with self-supervised deep nets Session

Deep learning affords novel and powerful techniques for video prediction and analysis. Arshak Navruzyan explores the current state of the art for video analysis using deep learning techniques and the associated challenges.

Matthias Niehoff is an IT consultant at codecentric AG in Germany, where he focuses on big data and streaming applications with Apache Cassandra and Apache Spark—as well as other tools in the area of big data. Matthias shares his experience at conferences, meetups, and user groups.

Presentations

Lessons learned working with Spark and Cassandra Session

Matthias Niehoff shares lessons learned working with Spark, Cassandra, and the Spark-Cassandra connector and best practices drawn from his work on multiple big and fast data projects, as well as challenges encountered along the way.

Kim Nilsson is the CEO of Pivigo, a London-based data science marketplace and training provider responsible for S2DS, Europe’s largest data science training program, which has by now trained more than 340 fellows working on over 85 commercial projects with 60+ partner companies, including Barclays, KPMG, Royal Mail, News UK, and Marks & Spencer. An ex-astronomer turned entrepreneur with a PhD in astrophysics and an MBA, Kim is passionate about people, data, and connecting the two.

Presentations

From data dinosaurs to data stars in five weeks: Lessons from completing 80 data science projects Session

More organizations are becoming aware of the value of data and want to get started and scaled up as quickly as possible. But how? Is it possible to get something useful done in five weeks? Kim Nilsson shares her experiences, both good and bad, delivering over 80 five-week data science projects to over 50 organizations, as well as some concrete tips on how to become a data star organization.

Meet the Expert with Kim Nilsson (Pivigo) Meet the Experts

Getting started with data science can be hard. Want some advice? Stop by and meet with Kim.

Michael Noll is a product manager at Confluent, the company founded by the creators of Apache Kafka. Previously, Michael was the technical lead of DNS operator Verisign’s big data platform, where he grew the Hadoop, Kafka, and Storm-based infrastructure from zero to petabyte-sized production clusters spanning multiple data centers—one of the largest big data infrastructures in Europe at the time. He is a well-known tech blogger in the big data community. In his spare time, Michael serves as a technical reviewer for publishers such as Manning and is a frequent speaker at international conferences, including ACM SIGIR, Web Science, and ApacheCon. Michael holds a PhD in computer science.

Presentations

Rethinking stream processing with Apache Kafka: Applications versus clusters and streams versus databases Session

Michael Noll explains how Apache Kafka helps you radically simplify your data processing architectures by building normal applications to serve your real-time processing needs rather than building clusters or similar special-purpose infrastructure—while still benefiting from properties typically associated exclusively with cluster technologies.

Michael Nolting is a data scientist for Volkswagen commercial vehicles. Michael has worked in a variety of research fields at Volkswagen AG, including adapting big data technologies and machine learning algorithms to the automotive context. Previously, he was head of a big data analytics team at Sevenval Technologies. Michael holds a Dipl.-Ing. degree in electrical engineering and an MSc degree in computer science, both from the Technical University of Brunswick in Germany, and a PhD in computer science.

Presentations

How to prevent future accidents in autonomous driving Session

It is nearly impossible to sample enough training data initially to prevent autonomous driving accidents on the road, as has been sadly proven by Tesla’s autopilot. Michael Nolting explains that to overcome this problem, a real-time system has to be created to detect dangerous runtime situations in real time, a process much like website monitoring.

Jack Norris is the senior vice president of data and applications at MapR Technologies. Jack has a wide range of demonstrated successes, from defining new markets for small companies to increasing sales of new products for large public companies, in his 20 years spent in enterprise software marketing. Jack’s broad experience includes launching and establishing analytics, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity, Brio Technology, SQRIBE, and Bain & Company. Jack has an MBA from UCLA’s Anderson School of Management and a BA in economics with honors and distinction from Stanford University.

Presentations

Identifying and exploiting the keys to digital transformation Session

Leading companies are integrating operations and analytics to make real-time adjustments to improve revenues, reduce costs, and mitigate risks. There are many aspects to digital transformation, but the timely delivery of actionable data is both a key enabler and an obstacle. Jack Norris explores how companies from Altitude Digital to Uber are transforming their businesses.

Tim O’Reilly has a history of convening conversations that reshape the computer industry. In 1998, he organized the meeting where the term “open source software” was agreed on and helped the business world understand its importance. In 2004, with the Web 2.0 Summit, he defined how “Web 2.0” represented not only the resurgence of the web after the dot-com bust but a new model for the computer industry, based on big data, collective intelligence, and the internet as a platform. In 2009, with his “Gov 2.0 Summit,” Tim framed the conversation about the modernization of government technology that has shaped policy and spawned initiatives at the federal, state, and local levels and around the world. He has now turned his attention to implications of the on-demand economy, AI, robotics, and other technologies that are transforming the nature of work and the future shape of the economy. Tim is the founder and CEO of O’Reilly Media and a partner at O’Reilly AlphaTech Ventures (OATV). He sits on the boards of Maker Media (which was spun out from O’Reilly Media in 2012), Code for America, PeerJ, Civis Analytics, and POPVOX.

Presentations

Using AI to create new jobs Keynote

The history of technology shows that while new technology has always destroyed jobs, it has also created new ones, in part because it makes things that were previously too expensive cheap enough to expand demand. Tim O'Reilly explains how AI will make currently unthinkable things possible. If we put it to work properly, it can lead to prosperity.

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of @WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Scott Kurth, and Stephen O'Sullivan field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for (and the evolving role of) the CDO. Even if you don’t have a specific question, join in to hear what others are asking.

Martin Oberhuber is a managing partner and principal data scientist at Think Big, a Teradata Company, where he is responsible for Think Big Services in the Western EMEA region. Martin successfully builds and leads cross-functional teams of data scientists, developers, and researchers and helps clients adopt new techniques and processes that empower them to take full advantage of their data. Previously, Martin led the International Data Science practice at Think Big. Martin has a broad range of research, development, and management skills and experience in several sectors, including finance, credit risk, retail, manufacturing, and consumer goods. He is a passionate problem solver with extensive experience developing automated trading strategies and credit risk models using machine learning, quantitative techniques, and big data technologies. Martin holds an MSc in mechanical engineering from the Swiss Federal Institute of Technology and an MSc in computational finance from Carnegie Mellon University.

Presentations

Empowering data analytics: Real-life use cases Session

In today’s big data world, organizations are struggling to establish new capabilities, processes, and organization models to deliver advanced analytics solutions. Martin Oberhuber explores real-world use cases that illustrate the capabilities needed to develop, deploy, monitor, and maintain analytical processes to seamlessly go from insight to production.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Enabling data science in the enterprise Keynote

Mike Olson and Tom Smith explain how the Office of National Statistics (ONS), the UK's largest independent producer of official statistics, is leveraging data science to create repeatable, accurate, and transferable statistical research while decreasing time in developing models with better visibility and results.

Nelleke Oostdijk is an associate professor at Radboud University in Nijmegen, the Netherlands. A computational linguist with a keen interest in language use and variation, Nelleke has been involved in various projects directed at extracting information from social media data. More specifically, she has been exploring ways in which linguistic knowledge could be brought into play to improve on purely machine-learning approaches. In collaborations with different societal partners, she has helped demonstrate the strength of a hybrid approach when applied to a range of topic and domains, including detecting threatening tweets, identifying forum posts suggesting that specific food supplements might be contaminated, and topic and event detection in the case of tweets about natural disasters (earthquakes, floods, etc.) and emergencies.

Presentations

Relevancer: Finding and labeling relevant information in tweet collections Session

Identifying relevant tweets in tweet collections that are gathered via key words is a huge challenge. Ali Hürriyetoglu and Nelleke Oostdijk share the results of a study on using unsupervised and supervised machine learning with linguistic insight to enable people to identify relevant tweets for their needs and offer an overview of their tool, Relevancer.

Łukasz Osipiuk is a software engineer at the Teradata Center for Hadoop within Teradata Labs, where he is actively engaged in open source Presto development and architecture design. Łukasz was a core member of SQL-on-Hadoop startup Hadapt before its acquisition by Teradata in 2014. Previously, Łukasz was employed at GG Network, where he worked on its large-scale instant messenger core backend and distributed drive storage backend. He graduated from Warsaw University.

Presentations

Presto: Distributed SQL done faster Session

Wojciech Biela and Łukasz Osipiuk offer an introduction to Presto, an open source distributed analytical SQL engine that enables users to run interactive queries over their datasets stored in various data sources, and explore its applications in various big data problems.

Jerry Overton is a data scientist and distinguished technologist in DXC’s Analytics group, where he is the principal data scientist for industrial machine learning, a strategic alliance between DXC and Microsoft comprising enterprise-scale applications across six different industries: banking and capital markets, energy and technology, insurance, manufacturing, healthcare, and retail. Jerry is the author of Going Pro in Data Science: What It Takes to Succeed as a Professional Data Scientist (O’Reilly) and teaches the Safari training course Mastering Data Science at Enterprise Scale. In his blog, Doing Data Science, Jerry shares his experiences leading open research and transforming organizations using data science.

Presentations

Executive Briefing: Advanced analytics in the cloud Session

This Executive Briefing is a part of the Strata Business Summit. Details to come.

Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of Advanced Analytics on Spark and Mahout in Action. Previously, Sean was a senior engineer at Google.

Presentations

Meet the Expert with Sean Owen (Cloudera) Meet the Experts

Sean will talk about Apache Spark, data science on Apache Hadoop, building data science teams, and open source software.

What "50 Years of Data Science" leaves out Session

Nobody seems to agree just what data science is. Is it engineering, statistics. . .both? David Donoho's "50 Years of Data Science" offers a criticism of the hype around data science from a statistics perspective, arguing that it's not a new field. Sean Owen responds, offering counterpoints from an engineer, in search of a better understanding of how to teach and practice data science in 2017.

Andy Petrella is the CEO of Kensu, where he also gets his hands dirty in Adalog’s code. Andy is a mathematician turned distributed computing entrepreneur. Besides being a Scala/Spark trainer, Andy participated in many projects built using Spark, Cassandra, and other distributed technologies in various fields including geospatial analysis, the IoT, and automotive and smart cities projects. Andy is the creator of the Spark Notebook, the only reactive and fully Scala notebook for Apache Spark. In 2015, Andy cofounded Data Fellas with Xavier Tordoir around their product the Agile Data Science Toolkit, which facilitates the productization of data science projects and guarantees their maintainability and sustainability over time. Andy is also member of the program committee for the O’Reilly Strata, Scala eXchange, Data Science eXchange, and Devoxx events.

Presentations

Data science governance: What and how Session

Data science for enterprise use cases explodes the number of intermediate datasets. Thus, one of upcoming challenges is to find a way into these ever-growing data sources. Andy Petrella proposes a data-science-on-data-science approach, using behavioral data combined with static and runtime metadata of processes.

Nicolas Poggi is an IT professional and researcher with focus on performance and scalability of data-intensive applications. Nicolas leads a new research project on upcoming architectures for the web at the Barcelona Super Computing and Microsoft Research Joint Center in Barcelona. Nicolas combines a pragmatic approach to performance and scalability from his web industry experience with research in server resource management (such as leveraging machine-learning techniques to optimize performance and profits on the web). Nicolas is a frequent speaker at and organizer for the Barcelona web performance community. He founded and has spoken at the Barcelona Web Performance group and is organizing the upcoming WebPerfDays event in Barcelona. Nicolas also lectures for master’s classes at UPC. He holds a PhD from BarcelonaTech (UPC).

Presentations

The state of Spark in the cloud Session

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.

Aurélie Pols designs data privacy best practices, documenting data flows in order to limit privacy backlashes and minimizing risk related to ever-increasing data uses while solving for data quality—the most accurate label would probably be "privacy engineer.” She is used to following the money to optimize data trails; now she follows the data to minimize increasing compliance and privacy risks and implement security best practices and ethical data use. Her mantra is: Data is the new oil; Privacy is the new green; Trust is the new currency. Aurélie is the chief visionary officer of Mind Your Group by Mind Your Privacy. She has spent the past 15 years optimizing (digital) data-based decision-making processes. She also cofounded and successfully sold a startup to Digitas LBi (Publicis). Aurélie has spoken at various events all over the globe, including SXSW, Strata + Hadoop World, the IAPP’s Data Protection Congress, Webit, and eMetrics summits, and has written several white papers on data privacy and privacy engineering best practices. Aurélie is a member of the European Data Protection Supervisor’s (EDPS) Ethics Advisory Group (EAG), cochairs the IEEE’s P7002—Data Privacy Process standard initiative, and serves as a training advisor to the International Association of Privacy Professionals (IAPP). Previously, she served as data governance and privacy advocate for leading data management platform (DMP) Krux Digital Inc. prior to its acquisition by Salesforce. She teaches privacy and ethics at IE Business School in Madrid and Solvay Business School in Brussels.

Presentations

Executive Briefing: Data governance and evolving privacy legislation—Daring to move beyond compliance Session

The EU's General Data Protection Regulation is an ambitious legal project to reinstate the rights of "data subjects" within an increasingly lucrative data ecosystem. Aurélie Pols explores the legal obligations on companies and their respective interpretations and looks at how scale and integrity will be safeguarded in the data we increasingly base decisions upon in the long term.

The data subject first? Keynote

You may have heard that a piece of legislation called the GDPR is looming. Aurélie Pols draws a broad philosophical picture of the data ecosystem we are all a part of before honing into the right to data portability, hopefully empowering you to reclaim your data subject rights.

Harry Powell is director and head of advanced data analytics at Barclays.

Presentations

Making recommendations using graphs and Spark Session

Harry Powell and Raffael Strassnig demonstrate how to model unobserved customer preferences over businesses by thinking about transactional data as a bipartite graph and then computing a new similarity metric—the expected degrees of separation—to characterize the full graph.

Emma Prest is the general manager of DataKind UK, where she handles the day-to-day operations supporting the influx of volunteers and building understanding about what data science can do in the charitable sector. Emma sits on the Editorial Advisory Committee at the Bureau of Investigative Journalism. She was previously a program coordinator at Tactical Tech, providing hands-on help for activists using data in evidence-based campaigns. Emma holds an MA in public policy with a specialization in media, information, and communications from Central European University in Hungary and a degree in politics and geography from the University of Edinburgh, Scotland.

Presentations

How do you help charities do data? Session

Since its creation, DataKind has helped charities do some fantastic things with data science through volunteers from the data science community (that's you!). But charities often don't know what to do next. Duncan Ross and Emma Prest share lessons learned from DataKind's projects and outline a data maturity model for doing good with data.

Iñaki Puigdollers is a data scientist at Social Point, where he leads the analytics function in the biggest game of the company (Dragon City). Previously, Iñaki worked in data insights at Schibsted Media Group. He holds an MSc in statistical modeling.

Presentations

"Smartifying" the game Session

Low cost, big impact: this is what data science can bring to your business. Iñaki Puigdollers explores how the analytics department changed Social Point games, creating an even better gaming experience and business.

Giovanni Quattrone is a lecturer in computing science in the Research group in applied software engineering in the Department of Computer Science at Middlesex University’s School of Science and Technology as well as an honorary member of the Department of Computer Science at University College London, UK. Previously, Giovanni was a research fellow in the Geospatial Analytics and Computing group at University College London and in the Department of Computer Science at University College London. He joined University College London thanks to the FP7-PEOPLE-2009-IEF Marie Curie Action.

Presentations

Algorithmic regulation Session

Sharing economy platforms are poorly regulated because there is no evidence upon which to draft policies. Daniele Quercia and Giovanni Quattrone propose a means for gathering evidence by matching web data with official socioeconomic data and use data analysis to envision regulations that are responsive to real-time demands, contributing to the emerging idea of algorithmic regulation.

Daniele Quercia is currently building the Social Dynamics group at Bell Labs in Cambridge, UK. Daniele’s research focuses on the area of urban informatics and has received best paper awards from Ubicomp 2014 and ICWSM 2015 as well as an honorable mention from ICWSM 2013. Previously, he was a research scientist at Yahoo Labs, a Horizon senior researcher at the University of Cambridge, and a postdoctoral associate at MIT. Daniele has been named one of Fortune magazine’s 2014 data all-stars and has spoken about “happy maps” at TED. He holds a PhD from UC London. His thesis was sponsored by Microsoft Research and was nominated for BCS best British PhD dissertation in computer science.

Presentations

Algorithmic regulation Session

Sharing economy platforms are poorly regulated because there is no evidence upon which to draft policies. Daniele Quercia and Giovanni Quattrone propose a means for gathering evidence by matching web data with official socioeconomic data and use data analysis to envision regulations that are responsive to real-time demands, contributing to the emerging idea of algorithmic regulation.

Phillip Radley is chief data architect on BT’s core Enterprise Architecture team, where he is responsible for data architecture across BT Group Plc. Based at BT’s Adastral Park campus in the UK, Phill currently leads BT’s MDM and big data initiatives, driving associated strategic architecture and investment roadmaps for the business. Phill has worked in IT and the communications industry for 30 years, mostly with British Telecommunications Plc., and his previous roles in BT include nine years as chief architect for infrastructure performance-management solutions from UK consumer broadband to outsourced Fortune 500 networks and high-performance trading networks. He has broad global experience, including with BT’s Concert global venture in the US and five years as an Asia Pacific BSS/OSS architect based in Sydney. Phill is a physics graduate with an MBA.

Presentations

Hadoop as a service: How to build and operate an enterprise data lake supporting operational and streaming analytics Session

If you have Hadoop clusters in research or an early-stage data lake and are considering strategic vision and goals, this session is for you. Phillip Radley explains how to run Hadoop as a shared service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably.

Meet the Expert with Phillip Radley (BT) Meet the Experts

BT has adopted Hadoop as an enterprise platform for data processing and storage. Come talk to Phillip to find out how they did it—and what you can learn from their experiences.

Syed Rafice is a senior system engineer at Cloudera, where he specializes in big data on Hadoop technologies and is responsible for designing, building, developing, and assuring a number of enterprise-level big data platforms using the Cloudera distribution. Syed also focuses on both platform and cybersecurity. He has worked across multiple sectors, including government, telecoms, media, utilities, financial services, and transport.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Radhika Rangarajan is an engineering director for big data technologies within Intel’s Software and Services group, where she manages several open source projects and partner engagements, specifically on Apache Spark and machine learning. Radhika is one of the cofounders and the director of the West Coast chapter of Women in Big Data, a grassroots community focused on strengthening the diversity in big data and analytics. Radhika holds both a bachelor’s and a master’s degree in computer science and engineering.

Presentations

Building deep learning-powered big data Session

Radhika Rangarajan explains how Intel works with its users to build deep learning-powered big data analytics applications (object detection, image recognition, NLP, etc.) using BigDL.

Pranav Rastogi is a program manager on the Azure HDInsight team, where he spends most of his time in making it easier for customers to leverage the big data ecosystem to build big data solutions faster.

Presentations

Build big data enterprise solutions faster on Azure HDInsight Session

Pranav Rastogi explains how to simplify your big data solutions with Datameer, AtScale, Dataiku, and StreamSets on Microsoft’s Azure HDInsight, a cloud Spark and Hadoop service for the enterprise. Join in to learn practical information that will enable faster time to insights for you and your business.

Miriam Redi is a research scientist on the Social Dynamics team at Bell Labs Cambridge, where her research focuses on content-based social multimedia understanding and culture analytics. In particular, Miriam explores ways to automatically assess visual aesthetics, sentiment, and creativity and exploit the power of computer vision in the context of web, social media, and online communities. Previously, she was a postdoc in the Social Media group at Yahoo Labs Barcelona and a research scientist at Yahoo London. Miriam holds a PhD from the Multimedia group in EURECOM, Sophia Antipolis.

Presentations

The science of visual interactions Keynote

Miriam Redi explores the invisible side of visual data, investigating how machine learning can detect subjective properties of images and videos, such as beauty, creativity, sentiment, style, and more curious characteristics. Miriam shows how these detectors can be applied in the context of web media search, advertising, and social media.

Tom Reilly is the CEO of Cloudera. Tom has had a distinguished 30-year career in the enterprise software market. Previously, Tom was vice president and general manager of enterprise security at HP; CEO of enterprise security company ArcSight, where he led the company through a successful initial public offering and subsequent sale to HP; and vice president of business information services for IBM, following the acquisition of Trigo Technologies Inc., a master data management (MDM) software company, where he served as CEO. He currently serves on the boards of Jive Software, privately held Ombud Inc., ThreatStream Inc., and Cloudera. Tom holds a BS in mechanical engineering from the University of California, Berkeley.

Presentations

Possibilities powered by the cloud Keynote

The cloud is disrupting every segment of our industry. If you don’t have a strategy for it, then you’re missing what might be your best new market opportunity. Tom Reilly and Charles Zedlewski talk about how machine learning and real-time data in the cloud are powering a new wave of possibilities.

Doron Reuter is head of business development for the ING Wholesale Banking Advanced Analytics team, where helps create customer value with big data, advanced analytics, and artificial intelligence by initiating projects and partnerships in order to ensure the development of data-driven algorithmic products for ING’s employees and clients. Doron has been a corporate and investment banker for 14 years at BNP Paribas in London, Fortis, and ING in the Netherlands. Previously, Doran worked at an internet startup in the ’90s and at Rational Software (now IBM). Doron holds a bachelor of science in computer science and economics and a MBA. He was born in South Africa. He now lives in the Netherlands with his wife, Daniela, and his children, Mia and Etan. Doron is treasurer of a charity, and when family and work allow, he swims, ice skates, works out at the gym, runs, and does pretty much anything else sporty that anyone feels like doing with him.

Presentations

Three years into creating value at ING Wholesale Banking with big data, advanced analytics, and artificial intelligence FinData

Join Doron Reuter to learn how ING is creating value with big data, advanced analytics, and artificial intelligence.

Alberto Rey is head of data science at easyJet, where he leads easyJet’s efforts to adopt advance analytics within different areas of the business. Alberto’s background is in air transport and economics, and he has more than 15 years’ experience in the air travel industry. Alberto started his career in advanced analytics as a member of the Pricing and Revenue Management team at easyJet, working in the development of one of the most advanced pricing engines within the industry, where his team pioneered the implementation of machine-learning techniques to drive pricing. He holds an MSc in data mining and an MBA from Cranfield University.

Presentations

Growing a data-driven organization at easyJet Tutorial

Many large organizations want to develop data science capabilities, but the traditional complexity and legacy of such companies don’t allow a fast and agile evolution toward data-driven decision making. EasyJet is working toward becoming completely data driven. Alberto Rey shares real-world examples on how easyJet is tackling the challenges of scaling up its analytics capabilities.

Stephane Rion is a senior data scientist at Big Data Partnership, where he helps clients get insight into their data by developing scalable analytical solutions in industries such as finance, gaming, and social services. Stephane has a strong background in machine learning and statistics with over 6 years’ experience in data science and 10 years’ experience in mathematical modeling. He has solid hands-on skills in machine learning at scale with distributed systems like Apache Spark, which he has used to develop production rate applications. In addition to Scala with Spark, Stephane is fluent in R and Python, which he uses daily to explore data, run statistical analysis, and build statistical models. He was the first Databricks-certified Spark instructor in EMEA. Stephane enjoys splitting his time between working on data science projects and teaching Spark classes, which he feels is the best way to remain at the forefront of the technology and capture how people are attempting to use Spark within their businesses.

Presentations

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML Tutorial

Stephane Rion introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine-learning library, using text mining on real-world data as the primary end-to-end use case.

Presentations

The IoT is driving the need for more secure big data analytics Session

Brendan Rizzo explains how data encryption and tokenization can help you protect your Hadoop environment and outlines options for securing data and speeding Hadoop implementation, drawing on recent deployments in pharma, health insurance, retail, and telecoms to illustrate the impact to operations and other areas of the business.

Ian Robinson is a specialist solutions architect for data and analytics at AWS, where he works with customers throughout EMEA, helping them use AWS to create value from the connections in their data. Ian is a coauthor of Graph Databases and REST in Practice (both from O’Reilly) and a contributor to REST: From Research to Practice (Springer) and Service Design Patterns (Addison-Wesley).

Presentations

Building your first big data application on AWS Tutorial

Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Ian Meyers, Pratim Das, and Ian Robinson walk you through building a big data application in real time using a combination of open source technologies, including Apache Hadoop, Spark, and Zeppelin, as well as AWS managed services such as Amazon EMR, Amazon Kinesis, and more.

Matthew Rocklin is an open source software developer focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.

Presentations

Dask: Flexible analytic computing for Python Session

Dask parallelizes Python libraries like NumPy, pandas, and scikit-learn, bringing a popular data science stack to the world of distributed computing. Matthew Rocklin discusses the architecture and current applications of dask used in the wild.

Meet the Expert with Matthew Rocklin (Continuum) Meet the Experts

Matthew will explain how to parallelize Python data science workflows with NumPy, pandas, and scikit-learn across a cluster with dask or other parallel computing tools.

Irene Ros is the director of data visualization at Bocoup and the program chair of OpenVis Conf, a two-day conference on data visualization on the open web. Irene is an information visualization researcher and developer, making engaging, informative, and interactive data-driven stories, experiences, and exploratory interfaces on the web. Previously, she was a research developer at IBM Research’s Visual Communication Lab. Irene holds a BS in computer science from the University of Massachusetts Amherst.

Presentations

Visualizing the health of the internet with Measurement Lab Session

Measurement Lab is the largest collection of open internet performance data on the planet, with over five petabytes of information about the quality of experience on the internet and more data generated every day. Irene Ros shares recent work to develop a data processing pipeline, API, and visualizations to make the data more accessible.

Duncan Ross is data and analytics director at TES Global. Duncan has been a data miner since the mid-1990s. Previously at Teradata, Duncan created analytical solutions across a number of industries, including warranty and root cause analysis in manufacturing and social network analysis in telecommunications. In his spare time, Duncan has been a city councilor, chair of a national charity, founder of an award-winning farmers’ market, and one of the founding directors of the Institute of Data Miners. More recently, he cofounded DataKind UK and regularly speaks on data science and social good.

Presentations

How do you help charities do data? Session

Since its creation, DataKind has helped charities do some fantastic things with data science through volunteers from the data science community (that's you!). But charities often don't know what to do next. Duncan Ross and Emma Prest share lessons learned from DataKind's projects and outline a data maturity model for doing good with data.

Barak Rozenwax is a machine-learning product owner and CSPO in Intel’s Advanced Analytics department, where he is part of a team that develops a deep learning training tool that enables easy creation and training of deep learning models. Barak’s previous roles included several product and system positions within his department in Intel. Barak has more than seven years of experience in software and systems engineering. He holds a BSc in industrial engineering and management with a focus on information systems from Ben-Gurion University of the Negev.

Presentations

Faster deep learning solutions from training to inference Session

Barak Rozenwax and Nir Lotan explain how to easily train and deploy deep learning models for image and text analysis problems using Intel's Deep Learning SDK, which enables you to use deep learning frameworks that were optimized to run fast on regular CPUs, including Caffe and TensorFlow.

Neelesh Srinivas Salian is a software engineer on the Data Platform Infrastructure team in the Algorithms group at Stitch Fix, where he works closely with the Apache Spark ecosystem. Previously, he worked at Cloudera where he was working with Apache projects like YARN, Spark, and Kafka. He holds a master’s degree in computer science with a focus on cloud computing from North Carolina State University and a bachelor’s degree in computer engineering from the University of Mumbai, India.

Presentations

How to secure Apache Spark? Session

Security has been a large and growing aspect of distributed systems, specifically in the big data ecosystem, but it's an underappreciated topic within the Spark framework itself. Neelesh Srinivas Salian explains how detailed knowledge of setting up security and an awareness of what to be looking out for in terms of problems and issues can help an organization move forward in the right way.

Mathew Salvaris is a data scientist at Microsoft. Previously, Mathew was a data scientist for a small startup that provided analytics for fund managers and a postdoctoral researcher at UCL’s Institute of Cognitive Neuroscience, where he worked with Patrick Haggard in the area of volition and free will, devising models to decode human decisions in real time from the motor cortex using electroencephalography (EEG), and a postdoctoral position at the University of Essex’s Brain Computer Interface group, where he worked on BCIs for computer mouse control. Mathew holds a PhD in brain computer interfaces and an MSc in distributed artificial intelligence.

Presentations

Speeding up machine-learning applications with the LightGBM library in real-time domains HDS

The speed of a machine-learning algorithm can be crucial in problems that require retraining in real time. Mathew Salvaris and Miguel González-Fierro introduce Microsoft's recently open sourced LightGBM library for decision trees, which outperforms other libraries in both speed and performance, and demo several applications using LightGBM.

Majken Sander is a data nerd, business analyst, and solution architect at TimeXtender. Majken has worked with IT, management information, analytics, BI, and DW for 20+ years. Armed with strong analytical expertise, She is keen on “data driven” as a business principle, data science, the IoT, and all other things data.

Presentations

Discover the business value of open data Tutorial

Majken Sander explains how to create a hub and start exploiting open data. Majken discusses which data can be found from external sources and how open data can add value by enhancing existing company data to gain new insights. There is a dataset out there for your business to become even more data driven. Join Majken to find it.

Kaz Sato is a staff developer advocate on the Cloud Platform team at Google, where he leads the developer advocacy team for machine-learning and data analytics products such as TensorFlow, the Vision API, and BigQuery. Kaz has been leading and supporting developer communities for Google Cloud for over seven years, is a frequent speaker at conferences, including Google I/O 2016, Hadoop Summit 2016 San Jose, Strata + Hadoop World 2016, and Google Next 2015 NYC and Tel Aviv, and has hosted FPGA meetups since 2013.

Presentations

TensorFlow in the wild; Or, the democratization of machine intelligence Session

TensorFlow is democratizing the world of machine intelligence. With TensorFlow (and Google's Cloud Machine Learning platform), anyone can leverage deep learning technology cheaply and without much expertise. Kazunori Sato explores how a cucumber farmer, a car auction service, and a global insurance company adopted TensorFlow and Cloud ML to solve their real-world problems.

Andrei Savu is a software engineer at Cloudera, where he’s working on Cloudera Director, a product that makes Hadoop deployments in cloud environments easy and more reliable for customers.

Presentations

How to optimally run Cloudera batch data engineering workflows in AWS Session

Cloudera Enterprise has made many focused optimizations in order leverage all of the cloud-native capabilities of AWS for the CDH platform. Andrei Savu and Philip Langdale take you through all the ins and outs of successfully running end-to-end batch data engineering workflows in AWS and demonstrate a Cloudera on AWS data engineering workflow with a sample use case.

Dominik Schniertshauer is a data scientist at the global headquarters of the BMW Group in Munich, Germany. As a deep learning enthusiast, Dominik dedicates himself to solving complex problems in the context of customer, logistics, and production data, focusing on the end-to-end character of deep learning use cases and their scalable implementation.

Presentations

Applying machine and deep learning to unleash value in the automotive industry Session

Data-driven solutions based on machine and deep learning are gaining momentum in the automotive industry beyond autonomous driving. Josef Viehhauser and Dominik Schniertshauer explore use cases from the BMW Group where novel machine-learning pipelines (such as those based on XGBoost and convolutional neural nets, for example) support a broad variety of business stakeholders.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning with TensorFlow 2-Day Training

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface, walking you through building machine-learning algorithms piece by piece and using the higher-level abstractions provided by TensorFlow. You'll then use this knowledge to build machine-learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

Robert Schroll and Patrick Smith demonstrate TensorFlow's capabilities through its Python interface, walking you through building machine-learning algorithms piece by piece and using the higher-level abstractions provided by TensorFlow. You'll then use this knowledge to build machine-learning models on real-world data.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for six years.

Presentations

Cloudy with a chance of on-prem Data 101

The cloud is becoming pervasive, but it isn’t always full of rainbows. Defining a strategy that works for your company or for your use cases is critical to ensuring success. Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations.

Jonathan Seidman is a software engineer on the Partner Engineering team at Cloudera. Previously, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures AMA

Mark Grover, Ted Malaska, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Mark Sellors is head of data engineering for Mango Solutions, where he helps clients run their data science operations in production-class environments. Mark has extensive experience in analytic computing and helping organizations in sectors from government to pharma to telecoms get the most from their data engineering environments.

Presentations

Spark and R with sparklyr Tutorial

R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Aimee Gott, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.

Robin Senge is a senior big data scientist on an analytics team at inovex GmbH, where he applies machine learning to optimize supply chain processes for one of the biggest groups of retailers in Germany. Robin holds an MSc in computer science and a PhD from the University of Marburg, where his research at the Computational Intelligence Lab focused on machine learning and fuzzy systems.

Presentations

Reliable prediction: Handling uncertainty HDS

Reliable prediction is the ability of a predictive model to explicitly measure the uncertainty involved in a prediction without feedback. Robin Senge shares two approaches to measure different types of uncertainty involved in a prediction.

Manuel Sevilla is a vice president and enterprise architect for Capgemini, where he leads the cloud market for financial services in Europe. Manuel monitors the cloud, big data, and analytics market and works closely with vendors, open source actors, and startups to understand the trends, the level of reliability and maturity, and the reality of the market and advise Capgemini customers on their strategic investments in this area.

Presentations

Executive Briefing: Cloud strategy Session

Manuel Sevilla shares real-world examples to illustrate the rules you need to keep in mind when designing your own cloud strategy.

Yahav Shadmi is a senior data scientist in Intel’s Advanced Analytics department, a group that provides solutions for diverse company challenges using machine learning and big data techniques, where he leads data science projects and solves data-driven problems in the CPU design and architecture domain. Yahav’s current research is on optimization acceleration of deep learning tasks. He holds an MSc in computer science and machine learning from the Haifa University, Israel.

Presentations

Reducing neural-network training time through hyperparameter optimization Session

Neural-network models have a set of configuration hyperparameters tuned to optimize a given model's accuracy. Yahav Shadmi demonstrates how to select hyperparameters to significantly reduce training time while maintaining accuracy, present examples for popular neural network models used for text and images, and describe a real-world optimization method for tuning.

Ben Sharma is CEO and cofounder of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions. With previous experience in technology leadership positions for NetApp, Fujitsu, and others, Ben’s expertise ranges from development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization, and storage. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes, and he holds two patents.

Presentations

Building a modern data architecture for scale Session

When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started.

Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

Presentations

Ask me anything: Unraveling data with Spark using machine learning AMA

Join Vartika Singh, Jayant Shekha, and Jeffrey Shmain to ask questions about their tutorial, Unraveling data with Spark using machine learning, or anything else Spark related.

Unraveling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches using the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data.

Tomer Shiran is cofounder and CEO of Dremio. Previously, Tomer was the VP of product at MapR, where he was responsible for product strategy, roadmap, and new feature development. As a member of the executive team, he helped grow the company from 5 employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He is the author of five US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion, the Israel Institute of Technology.

Presentations

Creating a virtual data lake with Apache Arrow Session

In most organizations, data is spread across multiple data sources, such as Hadoop/cloud storage, RDBMS, and NoSQL. Tomer Shiran and Jacques Nadeau offer an overview of Apache Arrow, an open source in-memory columnar technology that enables users to combine multiple data sources and expose them as a virtual data lake to users of Spark, SQL-on-Hadoop, Python, and R.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Ask me anything: Unraveling data with Spark using machine learning AMA

Join Vartika Singh, Jayant Shekha, and Jeffrey Shmain to ask questions about their tutorial, Unraveling data with Spark using machine learning, or anything else Spark related.

Unraveling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches using the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data.

Tanvi Singh is the chief analytics officer, CCRO, at Credit Suisse, where she leads a team of 15+ data scientists and analytics SME globally in Zurich, New York, London, and Singapore that is responsible for delivering multimillion dollar projects in big data with leading Silicon Valley vendors in the space of regulatory technology (regtech). Tanvi has 18 years of experience managing big data analytics, SAP business intelligence, data warehousing, digital analytics, and Siebel CRM platforms, with a focus on statistics, machine learning, text mining, and visualizations. She also has experience in quality as a Lean Six Sigma Black Belt. Tanvi holds a master’s degree in software systems from the University of Zurich.

Presentations

Surveillance and monitoring FinData

Regtech is one of the fastest growing areas in financial world. Tanvi Singh showcases the use of data science tools and techniques in this space and offers a holistic view of how to do surveillance and monitoring using a man + machine approach.

Vartika Singh is a solutions architect at Cloudera with over 10 years of experience applying machine-learning techniques to big data problems.

Presentations

Ask me anything: Unraveling data with Spark using machine learning AMA

Join Vartika Singh, Jayant Shekha, and Jeffrey Shmain to ask questions about their tutorial, Unraveling data with Spark using machine learning, or anything else Spark related.

Unraveling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches using the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data.

Vikas Singh is a software engineer at Cloudera.

Presentations

Big data governance for the hybrid cloud: Best practices and how-to Session

Big data needs governance—not just for compliance but also for data scientists. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start, especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Vikas Singh share a step-by-step approach to kickstart your big data governance.

Adam Smith is the chief operating officer at Automated Insights, where he is responsible for all areas of Automated Insights business, including the Wordsmith platform, new products, and professional service implementations. In addition to running company operations, he leads business development, partnerships and marketing for AI. Previously, Adam was an SVP at Square 1 Bank, where he launched a national division focused on pre-VC startups and advised hundreds of early-stage founders on strategy, sales, and fundraising, and was VP of at the CED, where he worked with the Kauffman Foundation and led development of FastTrac Tech, a world-renowned mentoring program for technology entrepreneurs.

Presentations

The future of natural language generation, 2016–2026 Session

Natural language generation, the branch of AI that turns raw data into human-sounding narratives, is coming into its own in 2016. Adam Smith explores the real-world advances in NLG over the past decade and then looks ahead to the next. Computers are already writing finance, sports, ecommerce, and business intelligence stories. Find out what—and how—they’ll be writing by 2026.

Tom Smith is the managing director of the data science campus at the Office of National Statistics.

Presentations

Enabling data science in the enterprise Keynote

Mike Olson and Tom Smith explain how the Office of National Statistics (ONS), the UK's largest independent producer of official statistics, is leveraging data science to create repeatable, accurate, and transferable statistical research while decreasing time in developing models with better visibility and results.

As chief data architect at Uber, M. C. Srivas worries about all data issues from trips, riders and partners, and pricing to analytics, self-driving cars, security, and data-center planning. Previously, M. C. was CTO and founder of MapR Technologies, a top Hadoop distribution; worked on search at Google, developing and running the core search engine that powered many of Google’s special verticals like ads, maps, and shopping; was chief architect at Spinnaker Networks (now Netapp), which formed the basis of Netapp’s flagship NAS products; and ran the Andrew File System team at Transarc, which was acquired by IBM. M. C. holds an MS from the University of Delaware and a BTech from IIT-Delhi.

Presentations

Real-time intelligence gives Uber the edge Keynote

M. C. Srivas covers the technologies underpinning the big data architecture at Uber and explores some of the real-time problems Uber needs to solve to make ride sharing as smooth and ubiquitous as running water, explaining how they are related to real-time big data analytics.

Tristan Stevens is a senior solutions architect at Cloudera, where he helps clients across EMEA with their Hadoop implementations. Tristan’s background is in the UK defence sector. He has also worked on large-scale, highly available, business-critical analytics platforms, with more recent experience in gaming, telecoms, and financial services.

Presentations

Near-real-time ingest with Apache Flume and Apache Kafka at 1 million-events-per-second scale Session

Vodafone UK’s new SIEM system relies on Apache Flume and Apache Kafka to ingest over 1 million events per second. Tristan Stevens discusses the architecture, deployment, and performance-tuning techniques that enable the system to perform at IoT-scale on modest hardware and at a very low cost.

Scott Stevenson is a data engineer at ASI Data Science, a London AI startup providing bespoke data science consultancy services, where he specialises in building scalable tools and infrastructure for data analysis and machine learning. Scott holds a PhD in particle physics from the University of Oxford, and before joining ASI analysed multi-terabyte datasets collected with the Large Hadron Collider at CERN.

Presentations

Ask me anything: Data science applications and deployment AMA

Angie Ma and Scott Stevenson share their experience and lessons learned from having worked on over 160 commercial data science projects with over 120 organizations from different sectors and industries.

Ben Stopford is an engineer and architect working on the Apache Kafka Core Team at Confluent (the company behind Apache Kafka). A specialist in data, both from a technology and an organizational perspective, Ben previously spent five years leading data integration at a large investment bank, using a central streaming database. His earlier career spanned a variety of projects at Thoughtworks and UK-based enterprise companies. He writes at Benstopford.com.

Presentations

Elastic streams: Dynamic data redistribution in Apache Kafka Session

Dynamic data rebalancing is a complex process. Ben Stopford and Ismael Juma explain how to do data rebalancing and use replication quotas in the latest version of Apache Kafka.

Darren Strange is the big data go-to-market lead for the Google Cloud Platform. Darren has worked in business development on the public cloud since its earliest days, spanning both Google Cloud Platform and competitive platforms, and has helped hundreds of companies across a range of industries move to the cloud and, in particular, to take advantage of advances in big data and machine learning.

Presentations

Architecting the future: Insights learned from Google’s journey in data Session

Darren Strange explores Google's lifelong mission to organize the world's information and make it universally accessible and useful and shares lessons learned along the way. Darren explains how Google grew from thinking of itself as a data company to being a machine-learning company and offers a glimpse of the company's future.

Machine learning is a moonshot for us all (sponsored by Google) Keynote

Data analytics and machine learning are the drivers of the fourth industrial revolution. As technologists, we stand on the brink of incredible opportunity. Darren Strange explores the tremendous opportunity we have before us and asks, will we be pioneers creating new possibilities or will we hold on to the past?

Raffael Strassnig is vice president and data scientist at Barclays, where he pushes the boundaries of predictive systems. Previously, Raffael worked on problems in dynamic advertising at Amazon and real-time analytics at Microsoft. In his free time, he enjoys solving maths riddles, programming in Scala, and cooking. He studied software engineering at the University of Technology in Graz, mathematics at the University of Vienna, and computational intelligence at the University of Technology in Vienna.

Presentations

Making recommendations using graphs and Spark Session

Harry Powell and Raffael Strassnig demonstrate how to model unobserved customer preferences over businesses by thinking about transactional data as a bipartite graph and then computing a new similarity metric—the expected degrees of separation—to characterize the full graph.

Bargava Subramanian is an India-based data scientist at Cisco Systems. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimization in Python and R around the world. Bargava holds a master’s degree in statistics from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations

Interactive data visualizations using Visdown Tutorial

Crafting interactive data visualizations for the web is hard—you're stuck using proprietary tools or must become proficient in JavaScript libraries like D3. But what if creating a visualization was as easy as writing text? Amit Kapoor and Bargava Subramanian outline the grammar of interactive graphics and explain how to use declarative markdown-based tool Visdown to build them with ease.

Meet the Expert with Amit Kapoor (narrativeVIZ Consulting) and Bargava Subramanian (Red Hat) Meet the Experts

Stop by and talk with Amit and Bargava about any aspect of narrative and interactive visualization or model (ML) visualization.

Sriskandarajah “Suho” Suhothayan is an associate director and architect at WSO2, where he focuses on real-time and big data technologies and provides technology consulting on customer engagements such as quick-start programs and architecture reviews. Suho is also a visiting lecturer at Robert Gordon University’s IIT Campus in Sri Lanka, where he teaches big data programming. He drives the design and development of Siddhi-CEP—a high-performance complex event processing engine that emerged from his academic studies—and has published several papers on real-time complex event processing systems. He holds a BSc in engineering from the University of Moratuwa, Sri Lanka, where he specialized in computer science and engineering.

Presentations

Transport for London: Using data to keep London moving Tutorial

Transport for London (TfL) and WSO2 have been working together on broader integration projects focused on getting the most efficient use out of London transport. Roland Major and Sriskandarajah Suhothayan explain how TfL and WSO2 bring together a wide range of data from multiple disconnected systems to understand current and predicted transport network status.

David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cybersecurity. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Meet the Expert with David Talby (Atigeo) Meet the Experts

Tell David about your projects. He can answer any questions you have on how to build large-scale data science platforms and machine learning and natural language understanding pipelines—and how to successfully deploy and operate them in production.

When models go rogue: Hard-earned lessons about using machine learning in production Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Eric Tilenius is CEO of BlueTalon, the leader in secure enterprise data integration across silos. Previously, Eric was an executive in residence at Scale Venture Partners, a general manager at Zynga, and CEO of two venture-backed startups— Netcentives (which he cofounded) and Answers.com, both of which had successful IPOs. He also held product management leadership positions at Oracle Corporation and Intuit and was a consultant with Bain & Company. Eric holds an MBA from Stanford University’s Graduate School of Business, where he was an Arjay Miller Scholar, and a bachelor’s degree in economics (summa cum laude) from Princeton University.

Presentations

EU GDPR as an opportunity to address both big data security and compliance Session

Many businesses will have to address EU GDPR as they deploy big data projects. This is an opportunity to rethink data security and deploy a flexible policy framework adapted to big data and regulations. Eric Tilenius explains how consistent visibility and control at a granular level across data domains can address both security and GDPR compliance.

David Tishgart is director of cloud product marketing at Cloudera. Prior to joining Cloudera, David ran product and partner marketing programs at Gazzang, helping drive business demand for enterprise encryption and key management for big data. Prior to Gazzang, he was director of services marketing at Dell. David holds a bachelor of broadcast journalism degree from The University of Texas at Austin.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Eugene Fratkin, Philip Langdale, David Tishgart, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Fergal Toomey is a specialist in network data analytics and a founder of Corvil, where he has been intensively involved in developing key product innovations directly applicable to managing IT system performance. Fergal has been involved in the design and development of innovative measurement and analysis algorithms for the past 12 years. Previously, he was an assistant professor at the Dublin Institute for Advanced Studies, where he was a member of the Applied Probability Group, which also included Raymond Russell, Corvil’s CTO. Fergal holds an MSc in physics and a PhD in applied probability theory, both from Trinity College, Dublin.

Presentations

Safeguarding electronic stock trading: Challenges and key lessons in network security Session

Fergal Toomey and Graham Ahearne outline the challenges facing network security in complex industries, sharing key lessons learned from their experiences safeguarding electronic trading environments to demonstrate the utility of machine learning and machine time network data analytics.

Zoltan Toth is a senior spark instructor on the Databricks training team and a visiting professor at the Central European University. A data engineer and trainer with over 17 years of experience building data-intensive applications, Zoltan also architects and prototypes big data architectures and regularly gives talks at meetups and conferences. Previously, he built Prezi’s data infrastructure and, as the company grew, transformed it into a Hadoop-based big data architecture—and managed the team that scaled it, crunching over a 1 petabyte of data. Zoltan also helped kick off the RapidMiner’s Apache Spark integration.

Presentations

Spark foundations: Prototyping Spark use cases on Wikipedia datasets 2-Day Training

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Zoltan Toth employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

Spark foundations: Prototyping Spark use cases on Wikipedia datasets (Day 2) Training Day 2

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Zoltan Toth employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

Steve Touw is the cofounder and CTO of Immuta. Steve has a long history of designing large-scale geotemporal analytics across the US intelligence community, including some of the very first Hadoop analytics as well as frameworks to manage complex multitenant data policy controls. He and his cofounders at Immuta drew on this real-world experience to build a software product to make data experimentation easier. Previously, Steve was the CTO of 42Six Solutions (acquired by Computer Sciences Corporation), where he led a large big data services engineering team. Steve holds a BS in geography from the University of Maryland.

Presentations

GDPR, data privacy, anonymization, minimization. . .oh my! Session

The global populace is asking for the IT industry to be held responsible for the safe-guarding of individual data. Steve Touw examines some of the data privacy regulations that have arisen and covers design strategies to protect personally identifiable data while still enabling analytics.

Herman van Hövell tot Westerflier is a Spark committer working on Spark SQL at Databricks. Previously, Herman was a consultant working for clients in banking, manufacturing, and logistics. His interests include database systems, optimization, and simulation.

Presentations

A behind-the-scenes look into Spark's API and engine evolutions Session

Herman van Hövell tot Westerflier looks back at the history of data systems, from filesystems, databases, and big data systems (e.g., MapReduce) to "small data" systems (e.g., R and Python), covering the pros and cons of each, the abstractions they provide, and the engines underneath. Herman then shares lessons learned from this evolution, explains how Spark is developed, and offers a peek...

A deep dive into Spark SQL's Catalyst optimizer Session

Herman van Hövell tot Westerflier offers a deep dive into Spark SQL's Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how new and upcoming features are implemented using Catalyst.

Vincent Van Steenbergen is a certified Spark consultant and trainer at w00t data, where he helps companies scale big data and machine-learning solutions into production-ready applications and provides Spark training and consulting to a broad range of companies across Europe and the US. Vincent is a coorganizer of the PAPIs.io international conference.

Presentations

Spark machine-learning pipelines: The good, the bad, and the ugly Session

Spark is now the de facto engine for big data processing. Vincent Van Steenbergen walks you through two real-world applications that use Spark to build functional machine-learning pipelines (wine price prediction and malware analysis), discussing the architecture and implementation and sharing the good, the bad, and the ugly experiences he had along the way.

Eduard Vazquez is head of research at Cortexica Vision Systems. His main research topics cover the study of color and perception, segmentation, medical imaging and object recognition. Eduard holds a PhD in computer vision from Universitat Autonoma de Barcelona, where he was also a lecturer in artificial intelligence and expert systems.

Presentations

Challenges in commercializing deep learning Tutorial

Cortexica had the first commercial implementation of a deep convolutional network in a GPU back in 2010. However, in the real world, running a CNN is not always a possibility. Eduard Vazquez discusses current challenges that commercial applications based on this technology are facing and how some of them can be tackled.

Gil Vernik is a researcher in IBM’s Storage Clouds, Security, and Analytics group, where he works with Apache Spark, Hadoop, object stores, and NoSQL databases. Gil has more than 25 years of experience as a code developer on both the server side and client side and is fluent in Java, Python, Scala, C/C++, and Erlang. He holds a PhD in mathematics from the University of Haifa and held a postdoctoral position in Germany.

Presentations

Hadoop and object stores: Can we do it better? Session

Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark.

Josef Viehhauser is a full stack data scientist at the BMW Group, where he leverages machine learning to create data-driven applications and improve established workflows along the company’s value chain. Josef also works on scoping and implementing such use cases in scalable ecosystems primarily via Python. Outside of work, he is interested in technological innovations and soccer.

Presentations

Applying machine and deep learning to unleash value in the automotive industry Session

Data-driven solutions based on machine and deep learning are gaining momentum in the automotive industry beyond autonomous driving. Josef Viehhauser and Dominik Schniertshauer explore use cases from the BMW Group where novel machine-learning pipelines (such as those based on XGBoost and convolutional neural nets, for example) support a broad variety of business stakeholders.

Kai Voigt is a senior instructor for Hadoop classes at Cloudera, delivering training classes for developers and administrators worldwide. Kai held the same role at MySQL, Sun, and Oracle. He has spoken at a number of O’Reilly conferences.

Presentations

Data science at scale: Using Spark and Hadoop 2-Day Training

Learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Using in-class simulations and exercises, Kai Voigt walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field.

Data science at scale: Using Spark and Hadoop (Day 2) Training Day 2

Learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Using in-class simulations and exercises, Kai Voigt walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the creation of the Lightbend Fast Data Platform, a streaming data platform built on the Lightbend Reactive Platform, Kafka, Spark, Flink, and Mesosphere DC/OS. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects and he is the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Just enough Scala for Spark Tutorial

Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs.

Meet the Expert with Dean Wampler (Lightbend) Meet the Experts

Dean will discuss trends in streaming data (so-called "fast data"), including Spark, Flink, Kafka, and even Scala, and explain what this means for Hadoop and emerging alternatives.

Stream all the things! Session

"Stream" is a buzzword for several things that share the idea of timely handling of never-ending data. Big data architectures are evolving to be stream oriented. Microservice architectures are inherently message driven. Dean Wampler defines "stream" based on characteristics for such systems, using specific tools as examples, and argues that big data and microservices architectures are converging.

Simon Wardley is a researcher for the Leading Edge Forum focused on the intersection of IT strategy and new technologies. Simon is a seasoned executive who has spent the last 15 years defining future IT strategies for companies in the FMCG, retail, and IT industries—from Canon’s early leadership in the cloud-computing space in 2005 to Ubuntu’s recent dominance as the top cloud operating system. As a geneticist with a love of mathematics and a fascination for economics, Simon has always found himself dealing with complex systems, whether in behavioral patterns, the environmental risks of chemical pollution, developing novel computer systems, or managing companies. He is a passionate advocate and researcher in the fields of open source, commoditization, innovation, organizational structure, and cybernetics.

Simon’s most recent published research, “Clash of the Titans: Can China Dethrone Silicon Valley?,” assesses the high-tech challenge from China and what this means to the future of global technology industry competition. His previous research covers topics including the nature of technological and business change over the next 20 years, value chain mapping, strategies for an increasingly open economy, Web 2.0, and a lifecycle approach to cloud computing. Simon is a regular presenter at conferences worldwide and has been voted one of the UK’s top 50 most influential people in IT in Computer Weekly’s 2011 and 2012 polls.

Presentations

Crossing the river by feeling the stones FinData

Simon Wardley examines the issue of situational awareness and explains how it applies to technology. Using examples from government, finance, and defense, Simon explores how you can map your environment, identify opportunities to exploit, and learn to play the game.

Galiya Warrier is a data solution architect at Microsoft, where she helps enterprise customers adopt Microsoft Azure Data technologies, from big data workloads to machine learning and chatbots.

Presentations

Conversation interfaces for data science models Session

Galiya Warrier demonstrates how to apply a conversational interface (in the form of a chatbot) to communicate with an existing data science model.

Charlotte Werger is the ASI education manager at ASI Data Science. A data scientist with a background in econometrics, Charlotte has worked in finance as a quantitative researcher and portfolio manager for BlackRock and Man AHL, using data science to predict movements in stock markets. She is a former ASI fellow, where she worked on predicting staff performance from psychometric test results, and has also worked on energy smart meter data analysis. Charlotte holds a PhD in economics from the European University Institute and an MPhil from Toulouse School of Economics.

Presentations

Practical machine learning with Python Tutorial

Charlotte Werger offers a hands-on overview of implementing machine learning with Python, providing practical experience while covering the most commonly used libraries, including NumPy, pandas, and scikit-learn.

Colin White is a managing director and technology fellow within the Engineering division at Goldman Sachs. Colin is global head of the Workflow group within Enterprise Platforms, a core engineering group that builds services and capabilities to decrease time to market and drive the level of standardization across workflow-based applications firm-wide.

Presentations

Software industrialization meets big data at Goldman Sachs FinData

Colin White discusses Goldman Sachs's industrialization program, under which it is digitizing processes, rules, and data in order to decrease cost, reduce time to market, and manage the risk of repetitive business processes. Goldman Sachs is taking models seriously; the data that is generated offers real insights into how to optimize its business.

Jan Willem Gehrels is a senior data scientist at the IBM Data Science Studios in Amsterdam. Jan Willem has extensive industry experience in the telecom, financial, and energy domains. Previously, he was a market analyst at KPN (a telecommunications company in the Netherlands) and a statistical consultant and research manager at the international research agency Millward Brown.

Presentations

The added value of data science Session

Big data is the new oil—an extremely valuable commodity—but how do you transform raw data into actionable insights, recommendations, and potential profits? Jan Willem Gehrels outlines the tangible value of applying advanced (predictive and prescriptive) analytics to business questions across several markets and industries.

Gary Willis is a data scientist at ASI with a diverse background in applying machine-learning techniques to commercial data science problems. Gary holds a PhD in statistical physics; his research looked at Markov Chain Monte Carlo simulations of complex systems.

Presentations

What does your postcode say about you? A technique to understand rare events based on demographics Session

Gary Willis offers a technical presentation of a novel algorithm that uses public data and an unsupervised tree-based learning algorithm to help companies leverage locational data they have on their clients. Along the way, Gary also discusses a wide range of further potential applications.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud strategy and solutions. Before joining Cloudera, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Eugene Fratkin, Philip Langdale, David Tishgart, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Kamran Yousaf is a solution architect at Redis Labs, where he specializes in the development of distributed, high-performance, low-latency architectures working with a wide range of technologies and architectures, from rule-based development, grid and low-latency applications to enterprise file sync and share. Previously, he was vice president of engineering at UK startup SME, a leader in enterprise file sync and share, and worked at GigaSpaces, BEA, and Versata.

Presentations

Real-time machine learning with Redis, Apache Spark, TensorFlow, and more Session

Kamran Yousaf explains how to substantially accelerate and radically simplify common practices in machine learning, such as running a trained model in production, to meet real-time expectations, using Redis modules that natively store and execute common models generated by Spark ML and TensorFlow algorithms.

Víctor Zabalza is a data engineer at ASI Data Science. Interested in building awesome Python tools for Data Science. He has a background in high-energy astrophysics, with 10 years of research experience that included work on the origin of gamma-ray emission from systems within our galaxy.

Presentations

Automated data exploration: Building efficient analysis pipelines with dask Session

Data exploration usually entails making endless one-use exploratory plots. Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work. Victor offers an overview of the tool and explains how it was built and why it will become essential in the first steps of every data science project.

Tristan Zajonc is a senior engineering manager at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and an MPA in international development from Harvard and a BA in economics from Pomona College.

Presentations

Making self-service data science a reality Session

Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality.

Charles Zedlewski is vice president of product at Cloudera. Previously, Charles held various management roles in product strategy, management, and operations at SAP, where he led the development of a half a dozen new and follow-on releases for products that supported some of SAP’s major growth initiatives in GRC, sustainability, and EPM. Many of these products received substantial critical acclaim and collectively generated more than a hundred million dollars in new product revenues. Prior to SAP, Charles held product roles at BEA Systems and venture-backed software startups. Charles holds a bachelor’s degree from Carleton College and an MBA from MIT.

Presentations

Possibilities powered by the cloud Keynote

The cloud is disrupting every segment of our industry. If you don’t have a strategy for it, then you’re missing what might be your best new market opportunity. Tom Reilly and Charles Zedlewski talk about how machine learning and real-time data in the cloud are powering a new wave of possibilities.

Yingsong Zhang is a data scientist at ASI, where she has worked on everything from social media data to special data from clients to build predictive models. Yingsong has published over 10 first-author research papers in top journals and conferences in the field of signal/image processing and has accumulated extensive experience in algorithm design and information representation. She recently completed a three-year postdoc project at Imperial College London developing sampling theory and the application system. Yingsong holds a BA in mathematics, an MSc in artificial intelligence and pattern recognition from one of China’s top universities, and a PhD in signal and image processing from Cambridge University.

Presentations

Gaining additional labels for data: An introduction to using semisupervised learning for real problems Tutorial

There are sometimes occasions where the labels on data are insufficient. In such situations, semisupervised learning can be of great practical value. Yingsong Zhang explores illustrative examples of how to come up with creative solutions, derived from textbook approaches.