Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Vijay Srinivas Agneeswaran is director of technology at SapientNitro. Vijay has spent the last 10 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine-learning algorithms, such as LDA and random forests, over Spark. He also led a team that build a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He is a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies. Vijay holds a a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.

Presentations

Big data computations: Comparing Apache HAWQ, Druid, and GPU databases Session

The class of big data computations known as the distributed merge trees was built to aggregate user information across multiple data sources in the media domain. Vijay Srinivas Agneeswaran explores prototypes built on top of Apache HAWQ, Druid, and Kinetica, one of the open source GPU databases. Results show that Kinetica on a single G2.8x node outperformed clusters of HAWQ and Druid nodes.

Graham Ahearne is director of product management for security analytics at Corvil, where he is actively building the next generation of accelerated threat detection and investigation, powered by true real-time analysis of network data. A recognized industry expert, Graham has been advising and building information security solutions for Fortune 500 companies for over 15 years. His expertise and experience spans a broad range of information security technology types, with specialist focus on network forensics, security analytics, threat intelligence, managed services, and host-based security controls. Graham is a Certified Information Systems Security Professional (CISSP).

Presentations

Safeguarding electronic stock trading: Challenges and key lessons in network security Session

Fergal Toomey and Graham Ahearne outline the challenges facing network security in complex industries, sharing key lessons learned from their experiences safeguarding electronic trading environments to demonstrate the utility of machine learning and machine-time network data analytics.

Tyler Akidau is a staff software engineer at Google Seattle. He leads technical infrastructure’s internal data processing teams (MillWheel & Flume), is a founding member of the Apache Beam PMC, and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 Dataflow Model paper and the Streaming 101 and Streaming 102 articles on the O’Reilly website. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Realizing the promise of portability with Apache Beam Session

The world of big data involves an ever-changing field of players. Much as SQL is a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. Tyler Akidau explains how this vision has been realized and discusses the challenges that lie ahead.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

What's your data worth? Session

Valuing data can be a headache. The unique properties of data make it difficult to assess its overall value using traditional valuation approaches. John Akred discusses a number of alternative approaches to valuing data within an organization for specific purposes so that you can optimize decisions around its acquisition and management.

Robbie Allen is the founder and CEO of Automated Insights. The company’s Wordsmith NLG platform is revolutionizing the way professionals generate content with data. Wordsmith helps data-driven industries, including financial services, ecommerce, real estate, business intelligence, and media, achieve content scale, efficiency, and personalization for clients including the Associated Press, Allstate, the Orlando Magic, and Yahoo. Robbie drives the company’s strategic vision, oversees engineering and research, and ensures the company continues to be named one of the best places to work in the Raleigh-Durham area, an honor it’s received from the Triangle Business Journal four years in a row. In 2015, Robbie was named the North Carolina Technology Association’s Tech Exec of the Year. Robbie started writing code to automate the writing process while working at Cisco, where he was a distinguished engineer, the company’s top technical position. He has authored or coauthored 10 books about enterprise software and software development and spoken at numerous events including Strata, SXSW, and the MIT Sloan CIO Symposium. Robbie has two engineering master’s degrees from MIT and was recently appointed an adjunct professor at the UNC Kenan-Flagler Business School.

Presentations

The future of natural language generation, 2016–2026 Session

Natural language generation, the branch of AI that turns raw data into human-sounding narratives, is coming into its own in 2016. Robbie Allen explores the real-world advances in NLG over the past decade and then looks ahead to the next. Computers are already writing finance, sports, ecommerce, and business intelligence stories. Find out what—and how—they’ll be writing by 2026.

Mireia Alos Palos is a data scientist at Teradata. Her main interests are deep learning, machine learning and applied data science on open source technologies such as Spark. Previously, Mireia was a data scientist at KPN, the largest telecom company in the Netherlands. She holds a PhD in applied physics from the Delft University of Technology.

Presentations

Classifying restaurant pictures: An API with Spark and Slider Session

Mireia Alos Palop and Natalino Busa share an implementation for classifying pictures based on Spark and Slider, developed during the 2016 Yelp Restaurant Photo Classification challenge. Spark processes data and trains the ML model, which consists of deep learning and ensemble classification methods, while picture scoring is exposed via an API that is persisted and scaled with Slider.

Antonio Alvarez is the head of data innovation at Isban UK, which aims to spearhead the transformation to a data-driven organization through digital technology. In partnership with the CDO, Antonio is creating a collaborative environment where innovative strategies and propositions around data from all sides of the business can create value for customers more quickly. With a quick pace of adoption, Santander UK is now implementing different frameworks for scaling and broadening the impact of data to disrupt the bank from inside through the adoption of a guided self-service approach. Antonio has a background in economics and 18 years of experience in financial services across four countries in business, technology, change, and data.

Presentations

Data citizenship: The next stage of data governance Session

Successful organizations are becoming increasingly Agile, and the autonomy and empowerment that Agile brings create new active modes of engagement. Data governance however is still very much a centralized task that only CDOs and data owners actively care about. Antonio Alvarez and Lidia Crespo outline a more engaging and active method of data governance: data citizenship.

Anima Anandkumar is a principal scientist at Amazon Web Services. Anima is currently on leave from UC Irvine, where she is an associate professor. Her research interests are in the areas of large-scale machine learning, nonconvex optimization, and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. Previously, she was a postdoctoral researcher at MIT and a visiting researcher at Microsoft Research New England. Anima is the recipient of several awards, including the Alfred. P. Sloan fellowship, the Microsoft faculty fellowship, the Google research award, the ARO and AFOSR Young Investigator awards, the NSF CAREER Award, the Early Career Excellence in Research Award at UCI, the Best Thesis Award from the ACM SIGMETRICS society, the IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums, such as the Quora ML session, Huffington Post, Forbes, and O’Reilly Media. Anima holds a BTech in electrical engineering from IIT Madras and a PhD from Cornell University.

Presentations

Distributed deep learning on AWS using MXNet Tutorial

Deep learning is the state of the art in domains such as computer vision and natural language understanding. MXNet is a highly flexible and developer-friendly deep learning framework. Anima Anandkumar provides hands-on experience on how to use MXNet with preconfigured Deep Learning AMIs and CloudFormation Templates to help speed your development.

Distributed deep learning on AWS using MXNet Session

Anima Anandkumar demonstrates how to use preconfigured Deep Learning AMIs and CloudFormation templates on AWS to help speed up deep learning development and shares use cases in computer vision and natural language processing.

Jesse Anderson is a data engineer, creative engineer, and CEO of Smoking Hand. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in prestigious publications such as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time data engineering in the cloud 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks—both open source and managed cloud services—discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time data engineering in the cloud (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks—both open source and managed cloud services—discusses the leading cloud providers, and explains how to choose the right one for your company.

Amitai Armon is the chief data scientist for Intel’s Advanced Analytics group, which provides solutions for the company’s challenges in diverse domains ranging from design and manufacturing to sales and marketing, using machine learning and big data techniques. Previously, Amitai was the cofounder and director of research at TaKaDu, a provider of water-network analytics software to detect hidden underground leaks and network inefficiencies. The company received several international awards, including the World Economic Forum Technology Pioneers award. Amitai has about 15 years of experience in performing and leading data science work. He holds a PhD in computer science from the Tel Aviv University in Israel, where he previously completed his BSc (cum laude, at the age of 18).

Presentations

Faster deep learning solutions from training to inference Session

Amitai Armon and Nir Lotan explain how to easily train and deploy deep learning models for image and text analysis problems using Intel's Deep Learning SDK, which enables you to use deep learning frameworks that were optimized to run fast on regular CPUs, including Caffe and TensorFlow.

Reducing neural-network training time through hyper-parameter optimization Tutorial

Neural-network models have a set of configuration hyper-parameters, which are tuned to optimize the model's accuarcy. We demonstrate that hyper-parameters can also be selected such that training time is significantly reduced while accuracy is maintained. We present speedup examples for popular neural network models, used for text and images, and describe an optimization method we used for tuning.

CARME ARTIGAS BRUGAL, Founder & CEO – Synergic Partners, a strategic and technological consulting firm specialized in Big Data & Data Science, founded in 2006 and acquired by the Telefonica Group in November 2015.
Carme is a member of the Innovation Board of CEOE and member of the Industry Affiliate Partners at Columbia University NYC Data Science Institute. She is a demanded conference speaker on Big Data in several international forums, such as Strata + Hadoop World, and collaborates as a professor in various Master’s programs on new technologies, Big Data and innovation. She has recently been appointed by the American business magazine Insight Success as the only Spanish woman among the 30 most influential women in business.
She has more than 20 years of extensive expertise in the telecommunications and IT fields and a broad experience in several executive roles in both private companies as well as governmental institutions.
Carme is Master of Science in Chemical Engineering and Master in Business Administration by University Ramon Llull of Barcelona. Also, she has an Executive Degree in Venture Capital at the Haas School of Berkeley University in California.

Presentations

Executive Briefing: Analytics centers of excellence as a way to accelerate big data adoption by business Session

Big data technology is mature, but its adoption by business is slow, due in part to challenges like a lack of resources or the need for a cultural change. Carme Artigas explains why an analytics center of excellence (ACoE), whether internal or outsourced, is an effective way to accelerate the adoption and shares an approach to implementing an ACoE.

Doug Ashton is a senior data scientist at Mango Solutions, where he provides training and consultancy to a range of industries, from government to telecommunications and web retailers. Doug is a proponent of reproducible research and has spoken on such topics as reproducible environments and data analysis in teams.

Presentations

Spark and R with sparklyr Tutorial

R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Kate Ross-Smith, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.

Sascha Askani is a senior systems engineer at inovex GmbH. Sascha has a strong storage and disaster recovery background and has helped various customers master their digital transformation challenges. He now focuses on solutions for his customers’ big data needs, with an emphasis on distributed storage solutions.

Presentations

Building containerized Spark on a solid foundation with Quobyte and Kubernetes Session

Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Bäurer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes.

Denis Bauer leads the Transformational Bioinformatics team at Australia’s national science agency, the Commonwealth Scientific and Industrial Research Organisation (CSIRO)—the research institution behind fast WiFi, the Hendra virus vaccine, and polymer banknotes. She is also involved in initiatives to bring genomics into medical practice. Denis holds a PhD in bioinformatics with expertise in machine learning and genomics.

Presentations

How Apache Spark and AWS Lambda empower researchers to identify disease-causing mutations and engineer healthier genomes Tutorial

Denis C. Bauer explores how genomic research has leapfrogged to the forefront of big data and cloud solutions, outlines how to deal with “big” (many samples) and “wide” (many features per sample) data and how to keep runtime constant by using instantaneously scalable microservices with AWS Lambda, and contrasts Spark- and Lambda-based parallelization.

Daniel Bäurer is head of operations at inovex GmbH. Daniel has been designing and operating complex systems for over 15 years. He currently focuses on data center automation and Hadoop platforms.

Presentations

Building containerized Spark on a solid foundation with Quobyte and Kubernetes Session

Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Bäurer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes.

Arturo Bayo is team leader and senior data engineer at Synergic Partners, where he specializes in banking and finance projects. He has broad knowledge of database administration (SQL, Mongo DB, and Cassandra) and big data (Hadoop, R, Hive, and Spark). Arturo holds a bachelor’s of science degree in computer engineering from the Universidad Autonoma of Madrid and a bachelor’s of business administration (BBA) from UNED.

Presentations

The zoo multitenant ecosystem for efficient big data solutions Session

Big data's main challenge is technological evolution. Arturo Bayo shares a solution combining three approaches to deal with the dynamism of technology: multitenant architectures to allow the sharing of IT resources cost efficiently, modular big data components interacting with a Docker-based containers platform, and advanced analytics to predict infrastructure behavior to optimize its efficiency.

Hellmar Becker is a solutions engineer at Hortonworks, where he is helping spread the word about what you can do with data in the modern world. Hellmar has worked in a number of positions in big data analytics and digital analytics. Previously, he worked at ING Bank implementing the Datalake Foundation project (based on Hadoop) within client information management.

Presentations

Daddy, what color is that airplane overhead, and where is it going? Session

Hellmar Becker and Jorn Eilander explore real-time collection and predictive analytics of flight radar data with IoT devices, NiFi, HBase, Spark, and Zeppelin.

Alice Bentinck MBE is cofounder of Entrepreneur First. EF supports the world’s most ambitious technologists to build their own startups from scratch. Over the last 5 years EF has built more than 100 companies, valued at $500m. Last year, one of their companies Magic Pony Technology was sold to Twitter for a reported $150m. Alice also set up Code First: Girls which has taught 5,000 young women to code, for free, at university.

Presentations

Keynote with Alice Bentinck Keynote

Alice Bentinck MBE, cofounder, Entrepreneur First.

Wojciech Biela is the engineering manager for the Warsaw-based Teradata Center for Hadoop team (within Teradata Labs), which is devoted to open source Presto development. Previously, Wojciech helped build the Polish branch for Hadapt, an SQL-on-Hadoop startup from Boston, which was acquired by Teradata in 2014, and developed projects and led development teams across many industries, from large-scale search, ecommerce, and personal banking to POS systems. Wojciech graduated from the Wrocław University of Technology.

Presentations

Presto: Distributed SQL done faster Session

Wojciech Biela and Łukasz Osipiuk offer an introduction to Presto, an open source distributed analytical SQL engine that enables users to run interactive queries over their datasets stored in various data sources, and explore its applications in various big data problems.

Cihan Biyikoglu is vice president of product management at Redis Labs. A big data enthusiast with over 20 years of experience, Cihan has been a C/C++ developer; director of product management at Couchbase, where he was responsible for the Couchbase Server product; and part of the team that launched Azure public cloud platform at Microsoft, where he also delivered a number of versions of SQL Server product suite. He also worked on Informix and Illustra database products that were later acquired by IBM. Cihan has been awarded a number of patents in the database field. He holds a BS in computer engineering and an MS in database systems.

Presentations

Real-time machine learning with Redis, Apache Spark, TensorFlow, and more Session

Cihan Biyikoglu explains how to substantially accelerate and radically simplify common practices in machine learning, such as running a trained model in production, to meet real-time expectations, using Redis modules that natively store and execute common models generated by Spark ML and TensorFlow algorithms.

Matt Brandwein is director of product management at Cloudera, driving the platform’s experience for data scientists and data engineers. Before that, Matt led Cloudera’s product marketing team, with roles spanning product, solution, and partner marketing. Previously, he built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in computer science and mathematics from the University of Massachusetts Amherst.

Presentations

Making self-service data science a reality Session

Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality.

Mikio Braun is delivery lead for recommendation and search at Zalando, one of the biggest European fashion platforms. Mikio holds a PhD in machine learning and worked in research for a number of years before becoming interested in putting research results to good use in the industry.

Presentations

Deep learning in practice Session

Deep learning has become the go-to solution for many application areas, such as image classification or speech processing, but does it work for all application areas? Mikio Braun offers background on deep learning and shares his practical experience working with these exciting technologies.

Kay H. Brodersen is a data scientist at Google, where he works on Bayesian statistical models for causal inference in large-scale randomized experiments and anomaly detection in time series data. Kay studied at Muenster (Germany), Cambridge (UK), and Oxford (UK) and holds a PhD degree from ETH Zurich.

Presentations

Inferring the effect of an event using CausalImpact HDS

Causal relationships empower us to understand the consequences of our actions and decide what to do next. This is why identifying causal effects has been at the heart of data science. Kay Brodersen offers an introduction to CausalImpact, a new analysis library developed at Google for identifying the causal effect of an intervention on a metric over time.

Natalino Busa is the head of data science at Teradata, where he leads the definition, design, and implementation of big, fast data solutions for data-driven applications, such as predictive analytics, personalized marketing, and security event monitoring. Previously, Natalino served as enterprise data architect at ING and as senior researcher at Philips Research Laboratories on the topics of system-on-a-chip architectures, distributed computing, and parallelizing compilers. Natalino is an all-around technology manager, product developer, and innovator with a 15+ year track record in research, development, and management of distributed architectures and scalable services and applications.

Presentations

Classifying restaurant pictures: An API with Spark and Slider Session

Mireia Alos Palop and Natalino Busa share an implementation for classifying pictures based on Spark and Slider, developed during the 2016 Yelp Restaurant Photo Classification challenge. Spark processes data and trains the ML model, which consists of deep learning and ensemble classification methods, while picture scoring is exposed via an API that is persisted and scaled with Slider.

Yishay Carmiel is the head of Spoken Labs, a big data analytics unit that implements bleeding-edge deep learning and machine-learning technologies for speech recognition, computer vision, NLP, and data analysis. Yishay and his team are working on state-of-the-art technologies in artificial intelligence, deep learning, and large-scale data analysis. He has 15 years’ experience as an algorithm scientist and technology leader working on building large-scale machine-learning algorithms and serving as a deep learning expert.

Presentations

Conversation AI: From theory to the great promise Session

For years, people have been talking about the great promise of conversation AI. Recently, deep learning has taken us a few steps further toward achieving tangible goals, making a big impact on technologies like speech recognition and natural language processing. Yishay Carmiel offers an overview of the impact of deep learning, recent breakthroughs, and challenges for the future.

Michelle Casbon is director of data science at Qordoba. Previously, she was a senior data science engineer at Idibon, where she built tools for generating predictions on textual datasets. Michelle’s development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. She loves working with open source projects and has contributed to Apache Spark and Apache Flume. Her writing has been featured in the AI section of O’Reilly Radar. Michelle holds a master’s degree from the University of Cambridge, focusing on NLP, speech recognition, speech synthesis, and machine translation.

Presentations

Machine learning to automate localization with Apache Spark and other open source tools Session

Supporting multiple locales involves the maintenance and generation of localized strings. Michelle Casbon explains how machine learning and natural language processing are applied to the underserved domain of localization using primarily open source tools, including Scala, Apache Spark, Apache Cassandra, and Apache Kafka.

Haifeng Chen is a senior software architect at Intel’s Asia Pacific R&D Center. He has more than 12 years’ experience in software design and development, big data, and security, with a particular interest in image processing. Haifeng is the author of image browsing, editing, and processing software ColorStorm.

Presentations

Speed up big data encryption in Apache Hadoop and Spark Session

As the processing capability of the modern platforms come to memory speed, securing big data using encryption usually hurts performance. Haifeng Chen shares proven ways to speed up data encryption in Hadoop and Spark, as well as the latest progress in open source, and demystifies using hardware acceleration technology to protect your data.

Rumman Chowdhury is a senior manager and AI lead at Accenture, where she works on cutting-edge applications of artificial intelligence and leads the company’s responsible and ethical AI initiatives. She also serves on the board of directors for three AI startups. Rumman’s passion lies at the intersection of artificial intelligence and humanity. She comes to data science from a quantitative social science background. She has been interviewed by Software Engineering Daily, the PHDivas podcast, German Public Television, and fashion line MM LaFleur. In 2017, she gave talks at the Global Artificial Intelligence Conference, IIA Symposium, ODSC Masterclass, and the Digital Humanities and Digital Journalism conference, among others. Rumman holds two undergraduate degrees from MIT and a master’s degree in quantitative methods of the social sciences from Columbia University. She is near completion of her PhD from the University of California, San Diego.

Presentations

Mister P: Imputing granularity from your data Session

Multilevel regression and poststratification (MRP) is a method of estimating granular results from higher-level analyses. While it is generally used to estimate survey responses at a more granular level, MRP has clear applications in industry-level data science. Rumman Chowdhury reviews the methodology behind MRP and provides a hands-on programming tutorial.

Ira Cohen is a cofounder of Anodot and its chief data scientist, where he is responsible for developing and inventing its real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Learning the relationships between time series metrics at scale; or, Why you can never find a taxi in the rain HDS

Identifying the relationships between time series metrics lets them be used for predictions, root cause diagnosis, and more. Ira Cohen shares accurate methods that work on a large scale (e.g., behavioral pattern similarity clustering algorithms) and strategies for reducing FPs and FNs, reducing computational resources, and distinguishing correlation and causation.

Darren Cook is a director at QQ Trend, a financial data analysis and data products company. Darren has over 20 years of experience as a software developer, data analyst, and technical director and has worked on everything from financial trading systems to NLP, data visualization tools, and PR websites for some of the world’s largest brands. He is skilled in a wide range of computer languages, including R, C++, PHP, JavaScript, and Python. Darren is the author of two books, Data Push Apps with HTML5 SSE and Practical Machine Learning with H2O, both from O’Reilly. The latter can help you take what you learn from this talk and start actually using machine-learning algorithms in your organization.

Presentations

Machine-learning algorithms: What they do and when to use them Tutorial

Darren Cook explores the main types of machine-learning algorithms, describing the kinds of task each is suited to, the explainability, repeatability, scalability, training time, sensitivity to data issues, and downsides of each, and the types of answers you can hope to get from them.

Eddie Copeland is Director of Government Innovation at Nesta, an innovation foundation and think tank. He leads the organisation’s work on government data, behavioural insights, digital public services and digital democracy. Previously he was Head of Technology Policy at Policy Exchange, one of the UK’s most influential think tanks. He is the author of five reports on government use of technology and data, and a book on UK think tanks. Eddie is a regular writer and speaker on how government and public sector organisations can deliver more and better with less through smarter use of technology and data. He blogs at http://eddiecopeland.me and tweets @EddieACopeland

Presentations

Keynote with Eddie Copeland Keynote

Eddie Copeland, Director of Government Innovation at Nesta

Lidia Crespo is the chief data steward at Santander UK, where she leads the CDO team that supervises the governance of Santander UK big data platform. She and her team have been instrumental to the adoption of the technology platform by creating a sense of trust and with their deep knowledge of the data of the organization. Wither her experience in complex and challenging international projection projects and an audit, IT, and data background, Lidia brings a combination difficult to find.

Presentations

Data citizenship: The next stage of data governance Session

Successful organizations are becoming increasingly Agile, and the autonomy and empowerment that Agile brings create new active modes of engagement. Data governance however is still very much a centralized task that only CDOs and data owners actively care about. Antonio Alvarez and Lidia Crespo outline a more engaging and active method of data governance: data citizenship.

As Leader for the EU & Asia Data Engineering Team, Andy Crisp’s remit ensures that he keeps more than an eye on Innovation. Starting his career with Dun & Bradstreet in sales was a way into the world of Big Data, but since joining the Content Team in 2006 Andy has tirelessly led innovative and creative thinking, particularly in terms of how to build and improve the D&B global data asset. Andy is also included in the DataIQ Big Data 100 most influential people in data, and for 2015 was shortlisted for the Information Age, UK Top 50 data leaders.

Presentations

Artificial intelligence in the enterprise Session

Deep learning has shown significant promise in common knowledge extraction tasks. However, the reputation of neural networks for being black-box learners can retard adoption in enterprise businesses. Martin Goodson gives a tell-all account of an ultimately successful installation of a deep learning system in an enterprise environment.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata + Hadoop World conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Thursday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Development Editor in the Data Practice Area

Presentations

Data 101 welcome Tutorial

Shannon Cutt welcomes you to Data 101.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Piet is involved in and coordinates the research in the area of the use of secondary data, such as administrative data, internet data and other Big Data sources, for official statistics.

Presentations

Relevancer: Finding and Labeling Relevant Information in Tweet Collections Session

Identifying relevant tweets in tweet collections that are gathered via key words is a huge challenge. This become exponentially harder as the ambiguity of the used key term and the size of the collection increases. We introduce our study on using unsupervised and supervised machine learning with linguistic insight to enable people identify relevant tweets for their needs.

Ivan Luciano Danesi is a data scientist at UniCredit Business Integrated Solutions. Ivan is also a teaching assistant for the Department of Statistics at Università Cattolica del Sacro Cuore in Milan. He holds a PhD in statistics from the University of Padua and completed research at Università di Trieste in Trieste and CASS Business School in London.

Presentations

A value-retention customer relationship management case study for banking FinData

Big data for retail is a step-change innovation path to improving analytics and real-time capabilities to enable more effective and efficient business services, processes, and products. Fabio Oberto and Ivan Luciano Danesi explore a value-retention use case and outline a CRM solution that manages massive data to support business and create value through machine learning and predictive analytics.

Olivier de Garrigues is an EMEA solutions lead at Trifacta. Olivier has seven years’ experience in analytics. Previously, he was technical lead for business analytics at Splunk and a quantitative analyst at Accenture and Aon.

Presentations

Data wrangling for insurance FinData

Drawing on use cases from Trifacta customers, Olivier de Garrigues explains how to leverage data wrangling solutions in the insurance industry to streamline, strengthen, and improve data analytics initiatives on Hadoop.

Yves-Alexandre de Montjoye is a lecturer at Imperial College London, a research scientist at the MIT Media Lab, and a postdoctoral researcher at Harvard IQSS. His research aims to understand how the unicity of human behavior impacts the privacy of individuals—through re-identification or inference—in large-scale metadata datasets such as mobile phone, credit cards, or browsing data. Previously, he was a researcher at the Santa Fe Institute in New Mexico, worked for the Boston Consulting Group, and acted as an expert for both the Bill and Melinda Gates Foundation and the United Nations. Yves-Alexandre was recently named an innovator under 35 for Belgium. His research has been published in Science and Nature Scientific Reports and has been covered by the BBC, CNN, the New York Times, the Wall Street Journal, Harvard Business Review, Le Monde, Die Spiegel, Die Zeit, and El Pais as well as in his TEDx talks. His work on the shortcomings of anonymization has appeared in reports of the World Economic Forum, United Nations, OECD, FTC, and the European Commission. He is a member of the OECD Advisory Group on Health Data Governance. Yves-Alexandre holds a PhD in computational privacy from MIT, an MSc in applied mathematics from Louvain, an MSc (centralien) from École Centrale Paris, an MSc in mathematical engineering from KU Leuven, and a BSc in engineering from Louvain.

Presentations

Computational privacy and the OPAL project: Using big personal data safely Session

Yves-Alexandre de Montjoye shows how metadata can work as a fingerprint, identifying people in a large-scale metadata database even though no “private” information was ever collected, shares a formula that can be used to estimate the privacy of a dataset if you know its spatial and temporal resolution, and offers an overview of OPAL, a project that enables safe big data use using modern CS tools.

Emma Deraze is a data scientist with TES Global and a volunteer at DataKind UK.

Presentations

Open corporate ownership data Session

Emma Deraze explores a collaborative project between Datakind, Global Witness, and Open Corporates to analyze open UK corporate ownership data and presents findings and insights into the challenges facing open official data, specifically in the context of an international setting, such as complex corporate networks.

Ding Ding is a software engineer on Intel’s Big Data Technology team, where she works on developing and optimizing distributed machine learning and deep learning algorithms on Apache Spark, focusing particularly on large-scale analytical applications and infrastructure on Spark.

Presentations

Distributed deep learning at scale on Apache Spark with BigDL HDS

Built on Apache Spark, BigDL provides deep learning functionality parity with existing DL frameworks—with better performance. Ding Ding explains how BigDL helps make the big data platform a unified data analytics platform, enabling more accessible deep learning for big data users and data scientists.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Big data governance for the hybrid cloud: Best practices and how-to Session

Big data needs governance, not just for compliance but also for data scientists. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start—especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky shares a step-by-step approach to kick-start your big data governance initiatives.

Ted Dunning has been involved with a number of startups—the latest is MapR Technologies, where he is chief application architect working on advanced Hadoop-related technologies. Ted is also a PMC member for the Apache Zookeeper and Mahout projects and contributed to the Mahout clustering, classification, and matrix decomposition algorithms. He was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Presentations

Tensor abuse in the workplace Session

Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case).

Yuval Dvir is head of online partnerships at Google Cloud, where he helps organizations change and transform by adopting a Lean, Agile, and modern way of working powered by Google’s Cloud Platform and G Suite infrastructure and productivity suite. Previously, he led led product strategy and pperations across search ads, display, programmatic, YouTube, and shopping. Yuval is a digital transformation executive with 15 years’ experience combining deep product knowledge, rich data insights, and strategic operations know-how to lead change, innovation, and growth in global organizations. As Microsoft’s global head of business transformation, he rebuilt Skype’s data infrastructure and visualization layer to be later managed under a newly designed Global Insights team. The transformation effort created a modern digital ecosystem, hardwiring product, engineering, and business functions to it across all levels, making it the de facto operating model of Skype. As Skype’s lead for product strategy, he radically accelerated the shift to mobile and cloud, streamlined the user experience for a similar look and feel across all platforms, drove the migration of hundreds of millions of Skype and MSN Messenger customers onto a single network under the Skype brand, and coled the merger with Lync to become a unified consumer and enterprise global business. Yuval is an industry speaker, evangelist, and thought leader on developing and leading high-performance teams, divisions, and companies using analytics, culture, and agility as the main pillars. He holds a BSc from the Technion, Israel’s Institute of Technology, and an MBA from INSEAD Business School in France and Singapore.

Presentations

A wealth of information leads to a poverty of attention: Why adopting the cloud can help you stay focused on the right things Session

In an era when we are bombarded with data and tasks to finish, our ability to focus our attention becomes critical. When 70% of our code is for DevOps purposes and 90% of our data is dark, the cloud is a welcome, secure, and efficient relief. Yuval Dvir refutes common misconceptions about the cloud and explains why it's not a matter of "if" but "when" you'll move to the cloud.

Jorn Eilander is a Hadoop DevOps Engineer at ING. Jorn has lots of experience working with Hadoop in high-risk impacted enterprise environments as a data ingestion expert and Hadoop system engineer. As an IoT and home automation enthusiast, he’s worked with several Raspberry Pi and Arduino platforms to gather data for his Hadoop cluster.

Presentations

Daddy, what color is that airplane overhead, and where is it going? Session

Hellmar Becker and Jorn Eilander explore real-time collection and predictive analytics of flight radar data with IoT devices, NiFi, HBase, Spark, and Zeppelin.

Wael Elrifai is an avid technologist and management strategist bridging the divide between IT and business as Pentaho’s EMEA director of enterprise solutions. Wael is a member of the Association for Computing Machinery, the Special Interest Group for Artificial Intelligence, the Royal Economic Society, and Chatham House. He holds graduate degrees in both electrical engineering and economics.

Presentations

Big data science, the IoT, and the transportation sector Tutorial

Wael Elrifai leads a journey through the design and implementation of a predictive maintenance platform for Hitachi Rail. The industrial internet, the IoT, data science, and big data make for an exciting ride.

Chris Fregly is a research scientist at PipelineIO, a San Francisco-based streaming machine learning and artificial intelligence startup. Previously, Chris was a distributed systems engineer at Netflix, a data solutions engineer at Databricks, and a founding member of the IBM Spark Technology Center in San Francisco. Chris is a regular speaker at conferences and meetups throughout the world. He’s also an Apache Spark contributor, a Netflix Open Source committer, founder of the Global Advanced Spark and TensorFlow meetup, author of the upcoming book Advanced Spark, and creator of the upcoming O’Reilly video series Deploying and Scaling Distributed TensorFlow in Production.

Presentations

Deploy Spark ML TensorFlow AI models from notebooks to hybrid clouds (including GPUs) Session

Chris Fregly explores an often-overlooked area of machine learning and artificial intelligence—the real-time, end-user-facing "serving” layer in hybrid-cloud and on-premises deployment environments—and shares a production-ready environment to serve your notebook-based Spark ML and TensorFlow AI models with highly scalable and highly available robustness.

Ellen Friedman is a solutions consultant, scientist, and O’Reilly author currently writing about a variety of open source and big data topics. Ellen is a committer on the Apache Drill and Mahout projects. With a PhD in biochemistry and years of work writing on a variety of scientific and computing topics, she is an experienced communicator. Ellen is coauthor of Streaming Architecture, the Practical Machine Learning series from O’Reilly, Time Series Databases, and her newest title, Introduction to Apache Flink. She’s also coauthor of a book of magic-themed cartoons, A Rabbit Under the Hat. Ellen has been an invited speaker at Strata + Hadoop in London, Berlin Buzzwords, the University of Sheffield Methods Institute, and the Philly ETE conference and a keynote speaker for NoSQL Matters 2014 in Barcelona.

Presentations

Making a change: Digital transformation and organizational culture Data 101

Big data and emerging technologies offer powerful benefits, but for an organization to use them to their full advantage, a change in organizational culture is required. Ellen Friedman offers practical guidance on how to adopt an organizational culture that supports digital transformation, using examples from a variety of business use cases.

Maosong Fu is the technical lead for ​Heron and ​Real Time Analytics at Twitter. He ​is the author of ​few publications in distributed area​ and has a master’s degree from Carnegie Mellon University and bachelor’s from Huazhong University of Science and Technology.

Presentations

Speeding up Twitter Heron streaming by 5x Session

Twitter processes billions of events per day at the instant the data is generated. To achieve real-time performance, Twitter employs Heron, an open source streaming engine tailored for large-scale environments. Karthik Ramasamy and Maosong Fu share several optimizations implemented in Heron to improve throughput by 5x and reduce latency by 50–60%.

Barbara Fusinska is a data solution architect at Microsoft. Barbara has strong software development experience building diverse software systems gained while working with a variety of different companies. She believes in the importance of the data and metrics when growing a successful business. Barbara still enjoys programming activities. She has spoken at a number of conferences. You can read her thoughts on Twitter as @BasiaFusinska and on her blog.

Presentations

Deep learning with Microsoft Cognitive Toolkit Session

The popularity of deep learning is due in part to its capabilities in recognizing patterns from inputs such as images or sounds. Barbara Fusinska offers an overview of Microsoft Cognitive Toolbox, an open source framework offering various modules and algorithms enabling machines to learn like a human brain.

Eddie Garcia is chief security architect at Cloudera, a leader in enterprise analytic data management, where he draws on his more than 20 years of information and data security experience to help Cloudera enterprise customers reduce security and compliance risks associated with sensitive data sets stored and accessed in Apache Hadoop environments. Previously, Eddie was the vice president of infosec and engineering for Gazzang prior to its acquisition by Cloudera, where he architected and implemented secure and compliant big data infrastructures for customers in the financial services, healthcare, and public sector industries to meet PCI, HIPAA, FERPA, FISMA and EU data security requirements. He was also the chief architect of the Gazzang zNcrypt product and is author of two patents for data security.

Presentations

Machine learning to "spot" cybersecurity incidents at scale Session

The use of big data and machine learning to detect and predict security threats is a growing trend, with interest from financial institutions, telecommunications providers, healthcare companies, and governments alike. Eddie Garcia explores how companies are using Apache Hadoop-based approaches to protect their organizations and explains how Apache Spot is tackling this challenge head-on.

Bas Geerdink is a programmer, scientist, and IT manager at ING, where he is responsible for the fast data systems that process and analyze streaming data. Bas has a background in software development, design, and architecture with broad technical experience from C++ to Prolog to Scala. His academic background is in artificial intelligence and informatics. Bas’s research on reference architectures for big data solutions was published at the IEEE conference ICITST 2013. He occasionally teaches programming courses and is a regular speaker at conferences and informal meetings.

Presentations

Fast data at ING: Utilizing Kafka, Spark, Flink, and Cassandra for data science and streaming analytics Session

As a data-driven enterprise, ING is heavily investing in big data, analytics, and streaming processing. Bas Geerdink shares three use cases at ING and discusses their respective architectures and technology. All software is currently in production, running with modern tools such as Kafka, Cassandra, Spark, Flink, and H2O.ai.

Aurélien Géron is a machine-learning consultant. Previously, he led YouTube’s video classification team and was founder and CTO of two successful companies (a telco operator and a strategy firm). Aurélien is the author of several technical books, including the O’Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow.

Presentations

How knowledge graphs can help dramatically improve recommendations Session

Collaborative filtering is great for recommendations, yet it suffers from the cold-start problem. New content with no views is ignored, and new users get poor recommendation. Aurélien Géron shares a solution: knowledge graphs. With a knowledge graph, you can truly understand your users' interests and make better, more relevant recommendations.

Colin Gillespie is a senior lecturer at Newcastle University, UK, where he works on high-performance statistical computing and Bayesian statistics. Colin is also lead consultant at Jumping Rivers. He has been teaching R since 2005 at a variety of levels, ranging from beginning to advanced programming. Colin is author of the upcoming O’Reilly book Efficient R Programming.

Presentations

Efficient R programming Session

R has the reputation for being slow. Colin Gillespie covers key ideas and techniques for making your R code as efficient as possible, from R setup to common R coding problems to linking R with C++ for an extra speed boost.

Anthony Goldbloom is co-founder and CEO of Kaggle. In 2011 & 2012, Forbes Magazine named Anthony as one of the 30 under 30 in technology, in 2013 the MIT Tech Review named him one of top 35 innovators under the age of 35 and the University of Melbourne awarded him an Alumni of Distinction Award. He holds a first call honors degree in Econometrics from the University of Melbourne. Anthony has published in the The Economist and the Harvard Business Review.

Presentations

Keynote with Anthony Goldbloom Keynote

Anthony Goldbloom, Founder, Kaggle

Miguel González-Fierro is a data scientist at Microsoft UK, where his job consists of helping customers leverage their processes using big data and machine learning. Previously, he was CEO and founder of Samsamia Technologies, a company that created a visual search engine for fashion items allowing users to find products using images instead of words, and founder of the Robotics Society of Universidad Carlos III, which developed different projects related to UAVs, mobile robots, small humanoids competitions, and 3D printers. Miguel also worked as a robotics scientist at Universidad Carlos III of Madrid and King’s College London, where his research focused on learning from demonstration, reinforcement learning, computer vision, and dynamic control of humanoid robots. He holds a BSc and MSc in electrical engineering and an MSc and PhD in robotics.

Presentations

Mastering computer vision problems with state-of-the art deep learning architectures, MXNet, and GPU virtual machines Session

Deep learning is one of the most exciting techniques in machine learning. Miguel González-Fierro explores the problem of image classification using ResNet, the deep neural network that surpassed human-level accuracy for the first time, and demonstrates how to create an end-to-end process to operationalize deep learning in computer vision for business problems using Microsoft RServer and GPU VMs.

Speeding up machine-learning applications with the LightGBM library in real-time domains HDS

The speed of a machine-learning algorithm can be crucial in problems that require retraining in real time. Mathew Salvaris and Miguel González-Fierro introduce Microsoft's recently open sourced LightGBM library for decision trees, which outperforms other libraries in both speed and performance, and demo several applications using LightGBM.

Martin Goodson is the chief scientist and CEO of Evolution AI, where he specializes in large-scale statistical computing and natural language processing. Martin has designed data science products that are in use at companies like Time Inc., Hearst, John Lewis, Condé Nast, and Buzzfeed. Previously, Martin worked as a statistician at the University of Oxford, where he conducted research on statistical matching problems for DNA sequences.

Presentations

10 ways your data project is going to fail and how to prevent it Tutorial

Data science continues to generate excitement, and yet real-world results can often disappoint business stakeholders. Martin Goodson offers a personal perspective on the most common failure modes of data science projects and discusses current best practices.

Artificial intelligence in the enterprise Session

Deep learning has shown significant promise in common knowledge extraction tasks. However, the reputation of neural networks for being black-box learners can retard adoption in enterprise businesses. Martin Goodson gives a tell-all account of an ultimately successful installation of a deep learning system in an enterprise environment.

Martin Görner works with developer relations at Google. Martin is passionate about science, technology, coding, algorithms, and everything in between. Previously, he worked in the computer architecture group of ST Microlectronics and spent 11 years shaping the nascent ebook market, starting at Mobipocket, a startup that later became the software part of the Amazon Kindle and its mobile variants. He graduated from Mines Paris Tech.

Presentations

TensorFlow and deep learning (without a PhD) Session

With TensorFlow, deep machine learning has transitioned from an area of research into mainstream software engineering. Martin Görner walks you through building and training a neural network that recognizes handwritten digits with >99% accuracy using Python and TensorFlow.

Trent Gray-Donald is a distinguished engineer in IBM Analytics’s Analytic Platform Services organization, where he works on analytics services for IBM Bluemix, including IBM Big Insights and Apache Spark. Previously, Trent worked on high-speed in-memory analytics solutions, such as Cognos BI and DB2 BLU. He was a member of the IBM Java Technology Centre and was overall technical lead on the IBM Java 7 project. Trent holds a bachelor of mathematics in computer science from the University of Waterloo, Canada.

Presentations

Hadoop and object stores: Can we do it better? Session

Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speed up for DFSIO on Hadoop and a 500% speed up for Terasort on Spark.

Mark Grover is a software engineer working on Apache Spark at Cloudera. Mark is a committer on Apache Bigtop and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

What no one tells you about writing a streaming app? Session

Any nontrivial streaming app requires that you consider a number of important topics, but questions like how to manage offsets or state often go unanswered. Mark Grover and Ted Malaska share practices that no one talks about when you start writing a streaming app but that you'll inevitably need to learn along the way.

Luke Han is the cofounder and CEO of Kyligence as well as the cocreator and PMC chair of Apache Kylin, where he drives the project’s strategy, roadmap, and product design and works on growing Apache Kylin’s community, building its ecosystem, and extending adoptions. Previously, Luke was big data product lead at eBay, where he managed Apache Kylin, engaging customers and coordinating various teams from different geographies, and chief consultant at Actuate China.

Presentations

Apache Kylin use cases in China Session

Apache Kylin is rapidly being adopted over the world—especially in China. Luke Han explores how various industries use Apache Kylin, sharing why these companies choose Apache Kylin (a technology comparison), how they use Apache Kylin (their production deployment pattern), and most importantly the resulting business impact.

Seth Hendrickson is a data scientist and Scala developer in IBM’s Spark Technology Center. Seth is focused on developing highly parallel machine-learning algorithms for the Apache Spark cluster computing ecosystem.

Presentations

Building a scalable recommendation engine with Spark and Elasticsearch Session

There are many resources available for learning how to use Spark to build collaborative filtering models. However, there are relatively few that explain how to build a large-scale, end-to-end recommender system. Seth Hendrickson demonstrates how to create such a system using Spark Streaming, Spark ML, and Elasticsearch.

Spark Structured Streaming for machine learning Session

Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and spends currently a lot of time writing a book, Stream Processing with Apache Flink.

Presentations

Stream analytics with SQL on Apache Flink Session

Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage.

Ali Hürriyetoglu has background in computer science. He is a PhD candidate at Radboud University and data scientist at Statistics Netherlands.

Presentations

Relevancer: Finding and Labeling Relevant Information in Tweet Collections Session

Identifying relevant tweets in tweet collections that are gathered via key words is a huge challenge. This become exponentially harder as the ambiguity of the used key term and the size of the collection increases. We introduce our study on using unsupervised and supervised machine learning with linguistic insight to enable people identify relevant tweets for their needs.

Anand Iyer is a senior product manager at Cloudera, the leading vendor of open source Apache Hadoop. His primary areas of focus are the Apache Spark ecosystem and platforms for real-time stream processing. Previously, Anand worked as an engineer at LinkedIn, where he applied machine-learning techniques to improve the relevance and personalization of LinkedIn’s feed. Anand has extensive experience in leveraging big data platforms to deliver products that delight customers. He has a master’s degree in computer science from Stanford and a bachelor’s degree from the University of Arizona.

Presentations

Spark ML: State of the union and a real-world case study from Kaiser Permanente Tutorial

Sameer Tilak and Anand Iyer offer an overview of recent developments in the Spark ML library and common real-world usage patterns, focusing on how Kaiser Permanente uses Spark ML for predictive analytics on healthcare data. Sameer and Anand share lessons learned building and deploying distributed machine-learning pipelines at Kaiser Permanente.

Matthew Jacobs is a software engineer at Cloudera working on Impala.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Jeroen Janssens is the founder of Data Science Workshops, which provides on-the-job training and coaching in data visualisation, machine learning, and programming. For one day a week, Jeroen is an assistant professor at Jheronimus Academy of Data Science. Previously, he was a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Presentations

Create interactive maps in seconds with R and Leaflet Session

Leaflet is one of the most popular open-source JavaScript libraries for interactive maps. It’s used by websites ranging from The New York Times and The Washington Post to GitHub and Flickr, as well as GIS specialists like OpenStreetMap, Mapbox, and CartoDB. The Leaflet R package makes it easy to integrate and control Leaflet maps in R.

Rekha Joshi is Principal Software Engineer working at Intuit Technology Group. At Intuit, Rekha designs and implements responsive large scale secure platform solutions with technologies like Apache Kafka, Spark, Cassandra and Hadoop technologies on Amazon Cloud. Previously she has delivered large-scale distributed solutions for internet scale at Yahoo!. She has worked in diverse domains of finance, advertising, supply chain and AI research.Her refueling stops include reading Issac Asimov, Richard Feynman and PG Wodehouse.

Presentations

Performance and security: A tale of two cities Session

Performance and security are often at loggerheads. Rekha Joshi explains why and offers a deep dive into how performance and security are managed in some of the most intense and critical data platform services at Intuit.

Ismael Juma is a Kafka committer and engineer at Confluent, where he is building a stream data platform based on Apache Kafka. Earlier, he worked on automated data balancing. Previously, Ismael was the lead architect at Time Out, where he was responsible for the data platform at the core of Time Out’s international expansion and print to digital transition. Ismael has contributed to several open source projects, including Voldemort and Scala.

Presentations

Elastic streams: Dynamic data redistribution in Apache Kafka Session

Dynamic data rebalancing is a complex process. Ben Stopford and Ismael Juma explain how to do data rebalancing and use replication quotas in the latest version of Apache Kafka.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

Driving the next wave of data lineage with automation, visualization, and interaction Session

Sean Kandel offers an overview of an entirely new approach to visualizing metadata and data lineage, explaining how to track how different attributes of data are derived during the data preparation process and the associated linkages across different elements in the data.

Amit Kapoor is interested in learning and teaching the craft of telling visual stories with data. At narrativeVIZ Consulting, Amit uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Amit also teaches storytelling with data as guest faculty in executive courses at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. He has more than 12 years of management consulting experience with AT Kearney in India, Booz & Company in Europe, and more recently for startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi and a PGDM (MBA) from IIM, Ahmedabad. Find more about him at Amitkaps.com.

Presentations

Interactive data visualizations using Visdown Tutorial

Crafting interactive data visualizations for the web is hard—you're stuck using proprietary tools or must become proficient in JavaScript libraries like D3. But what if creating a visualization was as easy as writing text? Amit Kapoor and Bargava Subramanian outline the grammar of interactive graphics and explain how to use declarative markdown-based tool Visdown to build them with ease.

Reiner Kappenberger is a global product manager at HPE Security–Data Security. Reiner has over 20 years of computer software industry experience focusing on encryption and security for big data environments. His background ranges from device management in the telecommunications sector to GIS and database systems. Reiner holds a diploma in computer science from the Regensburg University of Applied Sciences in Germany.

Presentations

The IoT is driving the need for more secure big data analytics Session

Reiner Kappenberger explains how data encryption and tokenization can help you protect your Hadoop environment and outlines options for securing data and speeding Hadoop implementation, drawing on recent deployments in pharma, health insurance, retail, and telecoms to illustrate the impact to operations and other areas of the business.

Holden Karau is a software development engineer at IBM and is active in open source. Prior to IBM, she worked on a variety of big data, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. Holden is the author of Learning Spark and has assisted with Spark workshops. She graduated from the University of Waterloo with a bachelors of mathematics in computer science.

Presentations

Debugging Apache Spark Session

Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau explores how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.

Spark Structured Streaming for machine learning Session

Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model.

Mubashir Kazia is a solutions architect at Cloudera focusing on security. Mubashir started the initiative integrating Cloudera Manager with Active Directory for kerberizing the cluster and provided sample code. Mubashir has also contributed patches to Apache Hive that fixed security-related issues.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Kenn Knowles is a founding committer of Apache Beam (incubating). Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.

Presentations

Unified stateful big data processing in Apache Beam (incubating) Session

Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Kenneth Knowles introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.

Presentations

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker offer an introduction to using Impala and Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Scott Kurth is the vice president of advisory services at Silicon Valley Data Science, where he helps clients define and execute the strategies and data architectures that enable differentiated business growth. Building on 20 years of experience making emerging technologies relevant to enterprises, he has advised clients on the impact of technological change, typically working with CIOs, CTOs, and heads of business. Scott has helped clients drive global technology strategy, conduct prioritization of technology investments, shape alliance strategy based on technology, and build solutions for their businesses. Previously, Scott was director of the Data Insights R&D practice within the Accenture Technology Labs, where he led a team focused on employing emerging technologies to discover the insight contained in data and effectively bring that insight to bear on business processes to enable new and better outcomes and even entire new business models, and led the creation of Accenture’s annual analysis of emerging technology trends impacting the future of IT, Technology Vision, where he was responsible for tracking emerging technologies, analyzing their transformational potential, and using it to influence technology strategy for both Accenture and its clients.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business.

Phil Langdale is a lead engineer at Cloudera working on Cloudera Manager. He has worked on all versions of Cloudera Manager since inception.

Presentations

How to optimally run Cloudera batch data analytic workflow in AWS Session

Cloudera Enterprise has made many focused optimizations in order leverage all of the cloud-native capabilities of AWS for the CDH platform. Andrei Savu and Philip Langdale take you through all the ins and outs of successfully running an end-to-end batch data analytic workflow in AWS.

Xueyan li is a data platform R&D engineer at Qunar, where he is mainly responsible for the continuous integrated development of resource management system Mesos and distributed memory management system Alluxio, as well as data for all business lines based on public service support. Other focuses include the ELK log ETL platform, Spark, Storm, Flink, and Zeppelin. He graduated from Heilongjiang University with a degree in software engineering.

Presentations

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Session

Real-time data analysis is becoming more and more important to Internet companies’ daily business. Qunar has been running Alluxio in production for over a year. Xueyan Li explores how stream processing on Alluxio has led to a 16x performance improvement on average and 300x improvement at service peak time on workloads at Qunar.

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine-learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Presentations

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker offer an introduction to using Impala and Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

Nir Lotan is a machine-learning product manager and team manager in Intel’s Advanced Analytics department. Nir’s team develops machine-learning and deep learning-related tools, including a tool that enables easy creation of deep learning models. Prior to this role, Nir held several product, system, and software management positions within Intel’s design center organization and other leading companies. Nir has 15 years of experience in software and systems engineering, products, and management. He holds a BSc degree in computer engineering from the Technion Institute of Technology.

Presentations

Faster deep learning solutions from training to inference Session

Amitai Armon and Nir Lotan explain how to easily train and deploy deep learning models for image and text analysis problems using Intel's Deep Learning SDK, which enables you to use deep learning frameworks that were optimized to run fast on regular CPUs, including Caffe and TensorFlow.

Alison Lowndes is a solution architect and community manager at NVIDIA. Alison has 25+ years in international project management and entrepreneurship with two decades spent within the internet arena. In her spare time, she is a founder trustee of a global volunteering network. A very recent graduate in artificial intelligence at the University of Leeds, where she where she completed a thorough empirical study of deep learning, specifically with GPU technology, covering the entire history and technical aspects of GPGPU with underlying mathematics, Alison combines technical and theoretical computer science with a physics background.

Presentations

Deep learning for object detection and neural network deployment Tutorial

Alison Lowndes leads a hands-on exploration of approaches to the challenging problem of detecting if an object of interest is present within an image and, if so, recognizing its precise location within the image. Along the way, Alison walks you through testing three different approaches to deploying a trained DNN for inference.

Megan Lucero is the Director of the Local Data Lab at the Bureau of Investigative Journalism. She was formerly the Data Journalism Editor at The Times and Sunday Times. Megan was part of the Times’s first data journalism team and led its development from a small supporting unit to a key component of news investigations. She spearheaded The Times’s political data unit ahead of the 2015 General Election – making it the only one in the industry to reject polling data ahead of the election. Using computational method, her team brought many issues into the public discourse and won awards for revealing the widespread use of blood doping in the Olympics.

Presentations

Keynote with Megan Lucero Keynote

Megan Lucero, Director of the Local Data Lab at the Bureau of Investigative Journalism.

Angie Ma is cofounder and COO of ASI Data Science, a London-based AI tech startup that offers data science as a service, which has completed more than 120 commercial data science projects in multiple industries and sectors and regarded as EMEA-based leader in data science. Angie is passionate about applying machine learning to solving real-world applications that generate business value for companies and organizations. She has experience delivery complex projects from prototyping to implementation. A physicist by training, Angie was previously a researcher in nanotechnology working on developing optical detection for medical diagnostics.

Presentations

Practical machine learning with Python Tutorial

Angie Ma offers a hands-on overview of implementing machine learning with Python, providing practical experience while covering the most commonly used libraries, including NumPy, pandas, and scikit-learn.

Mark Madsen is a research analyst at Third Nature, where he advises companies on data strategy and technology planning. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide. He focuses on two types of work: the business applications of data and guiding the construction of data infrastructure. As a result, Mark does as much information strategy and IT architecture work as he does performance management and analytics.

Presentations

Executive Briefing: Dealing with device data Session

In 2007, a computer game company decided to jump ahead of competitors by capturing and using data created during online gaming, but it wasn't prepared to deal with the data management and process challenges stemming from distributed devices creating data. Mark Madsen shares a case study that explores the oversights, failures, and lessons the company learned along its journey.

Organizing the data lake Session

Building a data lake involves more than installing and using Hadoop. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen discusses hidden design assumptions, reviews design principles to apply when building multiuse data infrastructure, and provides a reference architecture.

Roger Magoulas is the research director at O’Reilly Media and chair of the Strata + Hadoop World conferences. Roger and his team build the analysis infrastructure and provide analytic services and insights on technology-adoption trends to business decision makers at O’Reilly and beyond. He and his team find what excites key innovators and use those insights to gather and analyze faint signals from various sources to make sense of what others may adopt and why.​

Presentations

Thursday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Roland Major is an enterprise architect at Transport for London, where he works on the Surface Intelligent Transport System, which is looking to improve the operation of the roads network across London and provide greater insight from existing and new data sources using modern data analytic techniques. Previously, Roland worked on event-driven architectures and solutions in the nuclear, petrochemical, and transport industries.

Presentations

Transport for London: Using data to keep London moving Tutorial

Transport for London (TfL) and WSO2 have been working together on broader integration projects focused on getting the most efficient use out of London transport. Roland Major and Sriskandarajah Suhothayan explain how TfL and WSO2 bring together a wide range of data from multiple disconnected systems to understand current and predicted transport network status.

Ted Malaska is a senior solution architect at Blizzard. Previously, he was a principal solutions architect at Cloudera. Ted has 18 years of professional experience working for startups, the US government, some of the world’s largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the largest nonprofit financial regulator in the US and has worked on close to one hundred clusters for over two dozen clients with over hundreds of use cases. He has architecture experience across topics including Hadoop, Web 2.0, mobile, SOA (ESB, BPM), and big data. Ted is a regular contributor to the Hadoop, HBase, and Spark projects, a regular committer to Flume, Avro, Pig, and YARN, and the coauthor of O’Reilly Media’s Hadoop Application Architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

What no one tells you about writing a streaming app? Session

Any nontrivial streaming app requires that you consider a number of important topics, but questions like how to manage offsets or state often go unanswered. Mark Grover and Ted Malaska share practices that no one talks about when you start writing a streaming app but that you'll inevitably need to learn along the way.

James Malone is a product manager for Google Cloud Platform and manages Cloud Dataproc and Apache Beam (incubating). Previously, James worked at Disney and Amazon. James is a big fan of open source software because it shows what is possible when people come together to solve common problems with technology. He also loves data, amateur radio, Disneyland, photography, running, and Legos.

Presentations

Architecting and building enterprise-class Spark and Hadoop in cloud environments Tutorial

James Malone explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem.

Nikolay Manchev is a data scientist on IBM’s Big Data technical team. He specializes in machine learning, data science, and big data. He is a speaker, blogger, and the organizer of the London Machine Learning Study Group meetup. Nikolay holds an MSc in software technologies and an MSc in data science, both from City University London.

Presentations

Multi-node restricted Boltzmann machines for big data Session

Nikolay Manchev offers an overview of the restricted Boltzmann machine as a type of neural network with wide range of applications and shares his experience using it on Hadoop (MapReduce and Spark) to process unstructured and semistructured data at a scale.

Leah McGuire is a Lead Member of Technical Staff at Salesforce Einstein, building platforms to enable the integration of machine learning into Salesforce products. Before joining Salesforce, Leah was a Senior Data Scientist on the data products team at LinkedIn working on personalization, entity resolution, and relevance for a variety of LinkedIn data products. She completed a PhD and a Postdoctoral Fellowship in Computational Neuroscience at the University of California, San Francisco, and at University of California, Berkeley, where she studied the neural encoding and integration of sensory signals.

Presentations

Meta Data Science: When all the world's data scientists are just not enough Session

What if you had to build more models than there are data scientists in the world? Well, enterprise companies serving hundreds of thousands of businesses often have to do precisely this. In this talk, I'll describe our general purpose machine learning platform that automatically builds per-company optimized models for any given predictive problem at scale, beating out most hand tuned models.

Aida Mehonic is an engagement manager at ASI Data Science with a focus on financial services. Previously, she worked in investment banking for four years, most recently as a front office strategist at JPMorgan Investment Bank developing quantitative models and publishing investment research. Aida is a bronze medallist at the International Physics Olympiad. She holds a BA and MMath in mathematics from Cambridge University and a PhD in theoretical physics from UCL. Her research has been published in Nature.

Presentations

Deep learning in commodities markets Tutorial

Aida Mehonic explains how ASI Data Science has trained a deep neural net on historical prices of liquid financial contracts. The neural net has already outperformed comparable strategies based on expert systems.

Is finance ready for AI? FinData

Quantitative finance has been a key feature in the financial industry for over 30 years. Big data, machine learning, and AI are increasingly being used today, but is the financial industry actually ready for AI? Aida Mehonic explores some of the most common applications of AI in finance and shares the typical challenges of data transformation and AI adoption that financial institutions face.

Mostafa Mokhtar is a performance engineer at Cloudera. Previously, he held similar roles at Hortonworks and on the SQL Server team at Microsoft.

Presentations

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Sherry Moore is a software engineer on the Google Brain team. Her other projects at Google include Google Fiber and Google Ads Extractor. Previously, she spent 14 years as a systems and kernel engineer at Sun Microsystems.

Presentations

The state of TensorFlow and where it is going in 2017 Session

Sherry Moore discusses TensorFlow progress and adoption over 2016 and looks ahead to TensorFlow efforts in future areas of importance, such as performance, usability, and ubiquity.

Jonathon is the founder and CEO of New Knowledge, a company building technologies to understand and predict human behavior. As part of his ongoing work applying quantitative methods to combating violent extremism, he served as an advisor to the White House and State Department, co-authored the ISIS Twitter Census for the Brookings Institution, and develops new technology with DARPA.

Presentations

Fighting Bad Guys with Data Science Session

We'll look at the different computer vision, deep learning, and natural language processing techniques for uncovering communities of white nationalists and neo-Nazis on social media, and identifying which ones are on the path to radicalization.

Alan Mosca is senior data engineer at Sendence and a part-time doctoral researcher at Birkbeck, University of London, where his research focuses on deep learning ensembles and improvements to optimization algorithms in deep learning. Previously Alan worked at Wadhwani Asset Management, Jane Street Capital, and several software companies as well as on several consulting projects in machine learning and deep learning.

Presentations

Ensembles in deep learning with Toupee Tutorial

Alan Mosca discusses using ensembles in deep learning and tackles a benchmark problem in computer vision with Toupee, a library and toolkit for experimentation with deep learning and ensembles.

Barzan Mozafari is an assistant professor of computer science and engineering at the University of Michigan, Ann Arbor, where he leads a research group designing the next generation of scalable databases using advanced statistical models. Previously, Barzan was a postdoctoral associate at MIT. His research career has led to many successful open source projects, including CliffGuard (the first robust framework for database tuning), DBSeer (the first automated database diagnosis tool), and BlinkDB (the first massively parallel approximate query engine). Barzan has won the National Science Foundation CAREER award as well as several best paper awards in ACM SIGMOD and EuroSys. He is also a cofounder of DBSeer and a strategic advisor to SnappyData, a company that commercializes the ideas introduced by BlinkDB. Barzan holds a PhD in computer science from UCLA.

Presentations

Verdict: Platform-independent analytics and visualization at subsecond latencies Session

Visualization and exploratory analytics require subsecond interactions with massive volumes of data, a goal that has remained illusive due to numerous inefficiencies across the stack. Barzan Mozafari offers an overview of Verdict, an open source middleware that guarantees subsecond visualization and analytics and works with Impala, Spark, Hive, and most other engines in the Hadoop ecosystem.

Calum Murray is the Small Business group chief data architect at Intuit. Calum has 20 years experience in software development, primarily in the finance and small business spaces. Over his career, he has worked with various languages, technologies, and topologies to deliver everything from real-time payments platforms to business intelligence platforms.

Presentations

Journey to AWS: Straddling two worlds Session

As Intuit moves its SaaS platform from its own data centers to AWS, it will straddle both worlds for a period of time (and potentially indefinitely). Calum Murray looks at what straddling means to data and data systems.

John Musser is VP of engineering for Basho Technologies, creators of the NoSQL database Riak. John is a recognized industry expert and the founder of ProgrammableWeb, the leading online API resource for developers, and the DevOps service API Science. John consults on API and big data strategy with clients including Google, Microsoft, AT&T, and Salesforce. He has taught at Columbia University and University of Washington. John is frequently quoted in media outlets, including the Wall Street Journal, the New York Times, Forbes, and Wired, and often speaks at conferences such as OSCON, QCon, SXSW, Dreamforce, and Web 2.0.

Presentations

Lessons learned optimizing NoSQL for Apache Spark Session

How do you take a platform designed for large-scale storage of unstructured key-value data and optimize it for the structured world of Spark? John Musser leads a deep dive into integrating Riak, the distributed key-value NoSQL database, with Spark, covering the challenges and solutions for integrating these tools and sharing lessons learned along the way.

Jacques Nadeau is the cofounder and CTO of Dremio. He is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Previously, Jacques was the architect and engineering manager for Drill and other distributed systems technologies at MapR and the CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).

Presentations

Creating a virtual data lake with Apache Arrow Session

In most organizations, data is spread across multiple data sources, such as Hadoop/cloud storage, RDBMS, and NoSQL. Tomer Shiran and Jacques Nadeau offer an overview of Apache Arrow, an open source in-memory columnar technology that enables users to combine multiple data sources and expose them as a virtual data lake to users of Spark, SQL-on-Hadoop, Python, and R.

Paco Nathan leads the Learning Group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the Top 30 People in Big Data and Analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

Computable content: Notebooks, containers, and data-centric organizational learning Session

O'Reilly recently launched Oriole, a new learning medium for online tutorials that combines Jupyter notebooks, video timelines, and Docker containers run on a Mesos cluster, based the pedagogical theory of computable content. Paco Nathan explores the system architecture, shares project experiences, and considers the impact of notebooks for sharing and learning across a data-centric organization.

Allison Nau is head of data solutions at Cox Automotive UK. Allison is a highly driven and self-motivated big data, analytics, and product executive with a proven track record in transforming businesses and driving strategic growth through data analysis and product development. Previously, Allison worked at LexisNexis, where she developed the entire product portfolio of data and analytics products for its expansion into the UK, leading to double-digit growth year on year for that new venture while transforming the motor insurance industry. A trained quantitative political scientist who got her start as a price optimization consultant, Allison holds a BA in mathematics and international relations from the College of Wooster and an MA in political science from the University of Michigan.

Presentations

Big data at Cox Automotive: Delivering actionable insights to transform the way the world buys, sells, and owns vehicles Tutorial

Twenty months into its big data journey, Cox Automotive is using a variety of tools and techniques to deliver actionable insights, transforming decision making within the automotive industry. Allison Nau shares lessons learned, including where to begin transforming a legacy business and industry to become more data driven, how to gain momentum, and how to deliver meaningful results at pace.

Matthias Niehoff is an IT consultant at codecentric AG in Germany, where he focuses on big data and streaming applications with Apache Cassandra and Apache Spark—as well as other tools in the area of big data. Matthias shares his experience at conferences, meetups, and user groups.

Presentations

Lessons learned working with Spark and Cassandra Session

Matthias Niehoff shares lessons learned working with Spark, Cassandra, and the Spark-Cassandra connector and best practices drawn from his work on multiple big and fast data projects, as well as challenges encountered along the way.

Kim Nilsson is the CEO of Pivigo, a London-based data science marketplace and training provider responsible for S2DS, Europe’s largest data science training program, which has by now trained more than 300 fellows working on over 70 commercial projects with 50+ partner companies, including Barclays, KPMG, Royal Mail, and Marks & Spencers. An ex-astronomer turned entrepreneur with a PhD in astrophysics and an MBA, Kim is passionate about people, data, and connecting the two.

Presentations

From data dinosaurs to data stars in five weeks: Lessons from completing 80 data science projects Session

More organizations are becoming aware of the value of data and want to get started and scaled up as quickly as possible. But how? Is it possible to get something useful done in five weeks? Kim Nilsson shares her experiences, both good and bad, delivering over 80 five-week data science projects to over 50 organizations, as well as some concrete tips on how to become a data star organization.

Michael Noll is a product manager at Confluent, the company founded by the creators of Apache Kafka. Previously Michael was the technical lead of DNS operator Verisign’s big data platform, where he grew the Hadoop-, Kafka-, and Storm-based infrastructure from zero to petabyte-sized production clusters spanning multiple data centers—one of the largest big data infrastructures in Europe at the time. He is a well-known tech blogger in the big data community. In his spare time, Michael serves as a technical reviewer for publishers such as Manning and is a frequent speaker at international conferences, including ACM SIGIR, Web Science, and ApacheCon. Michael holds a PhD in computer science.

Presentations

Rethinking stream processing with Apache Kafka: Applications versus clusters and streams versus databases Session

Michael Noll explains how Apache Kafka helps you radically simplify your data processing architectures by building normal applications to serve your real-time processing needs rather than building clusters or similar special-purpose infrastructure—while still benefiting from properties typically associated exclusively with cluster technologies.

Michael Nolting is a data scientist for Volkswagen commercial vehicles. Michael has worked in a variety of research fields at Volkswagen AG, including adapting big data technologies and machine learning algorithms to the automotive context. Previously, he was head of a big data analytics team at Sevenval Technologies. Michael holds a Dipl.-Ing. degree in electrical engineering and an MSc degree in computer science, both from the Technical University of Brunswick in Germany, and a PhD in computer science.

Presentations

How to prevent future accidents in autonomous driving Session

It is nearly impossible to sample enough training data initially to prevent autonomous driving accidents on the road, as has been sadly proven by Tesla’s autopilot. Michael Nolting explains that to overcome this problem, a real-time system has to be created to detect dangerous runtime situations in real time, a process much like website monitoring.

Jack Norris is the senior vice president of data and applications at MapR Technologies. Jack has a wide range of demonstrated successes, from defining new markets for small companies to increasing sales of new products for large public companies, in his 20 years spent in enterprise software marketing. Jack’s broad experience includes launching and establishing analytics, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity, Brio Technology, SQRIBE, and Bain & Company. Jack has an MBA from UCLA’s Anderson School of Management and a BA in economics with honors and distinction from Stanford University.

Presentations

Identifying and exploiting the keys to digital transformation Session

Leading companies are integrating operations and analytics to make real-time adjustments to improve revenues, reduce costs, and mitigate risks. There are many aspects to digital transformation, but the timely delivery of actionable data is both a key enabler and an obstacle. Jack Norris explores how companies from Altitude Digital to Uber are transforming their businesses.

Tim O’Reilly has a history of convening conversations that reshape the computer industry. In 1998, he organized the meeting where the term “open source software” was agreed on and helped the business world understand its importance. In 2004, with the Web 2.0 Summit, he defined how “Web 2.0” represented not only the resurgence of the Web after the dot-com bust but a new model for the computer industry, based on big data, collective intelligence, and the Internet as a platform. In 2009, with his “Gov 2.0 Summit,” Tim framed the conversation about the modernization of government technology that has shaped policy and spawned initiatives at the federal, state, and local levels and around the world. He has now turned his attention to implications of the on-demand economy, AI, robotics, and other technologies that are transforming the nature of work and the future shape of the economy. He is exploring these topics at his Next:Economy Summit, taking place in San Francisco this October 10–11. Tim is the founder and CEO of O’Reilly Media and a partner at O’Reilly AlphaTech Ventures (OATV). He sits on the boards of Maker Media (which was spun out from O’Reilly Media in 2012), Code for America, PeerJ, Civis Analytics, and POPVOX.

Presentations

Keynote with Tim O'Reilly Keynote

Tim O'Reilly, Founder and CEO, O'Reilly Media

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Fabio Oberto is head of data governance and modeling at UniCredit Business Integrated Solutions, where he supported the initial setup for UniCredit’s big data farm and works on topics as architectural definition, strategy design, and program management. He is also responsible for data governance, data modeling, and data quality. Fabio holds a bachelor’s degree in computer science from University of Milan.

Presentations

A value-retention customer relationship management case study for banking FinData

Big data for retail is a step-change innovation path to improving analytics and real-time capabilities to enable more effective and efficient business services, processes, and products. Fabio Oberto and Ivan Luciano Danesi explore a value-retention use case and outline a CRM solution that manages massive data to support business and create value through machine learning and predictive analytics.

Łukasz Osipiuk is a software engineer at the Teradata Center for Hadoop within Teradata Labs, where he is actively engaged in open source Presto development and architecture design. Łukasz was a core member of SQL-on-Hadoop startup Hadapt before its acquisition by Teradata in 2014. Previously, Łukasz was employed at GG Network, where he worked on its large-scale instant messenger core backend and distributed drive storage backend. He graduated from Warsaw University.

Presentations

Presto: Distributed SQL done faster Session

Wojciech Biela and Łukasz Osipiuk offer an introduction to Presto, an open source distributed analytical SQL engine that enables users to run interactive queries over their datasets stored in various data sources, and explore its applications in various big data problems.

Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of Advanced Analytics on Spark and Mahout in Action. Previously, Sean was a senior engineer at Google.

Presentations

What "50 Years of Data Science" leaves out Session

Nobody seems to agree just what data science is. Is it engineering, statistics. . .both? David Donoho's "50 Years of Data Science" offers a criticism of the hype around data science from a statistics perspective, arguing that it's not a new field. Sean Owen responds, offering counterpoints from an engineer, in search of a better understanding of how to teach and practice data science in 2017.

Nicolas Poggi is an IT professional and researcher with focus on performance and scalability of web and data-intensive applications. Nicolas leads a new research project on upcoming architectures for the web at the Barcelona Super Computing and Microsoft Research joint center (http://www.bscmsrc.eu/) in Barcelona. Nicolas combines a pragmatic approach to performance and scalability from his web industry experience with research in server resource management (such as leveraging machine learning techniques to optimize performance and profits on the web). Nicolas is a frequent speaker and organizer in the Barcelona Web Performance community and founded and has spoken at the Barcelona Web Performance Group. He is organizing the upcoming WebPerfDays.org event in Barcelona and also lectures for master’s classes at UPC. Nicolas occasionally blogs about web performance. He holds a PhD from BarcelonaTech (UPC).

Presentations

The state of Spark in the cloud Session

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.

Aurélie Pols designs data privacy best practices, documenting data flows in order to limit privacy backlashes and minimizing risk related to ever-increasing data uses while solving for data quality—the most accurate label would probably be "privacy engineer.” She is used to following the money to optimize data trails; now she follows the data to minimize increasing compliance and privacy risks and implement security best practices and ethical data use. Her mantra is: Data is the new oil; Privacy is the new green; Trust is the new currency. Aurélie is the chief visionary officer of Mind Your Group. She has spent the past 15 years optimizing (digital) data-based decision-making processes. She also cofounded and successfully sold a startup to Digitas LBi (Publicis). Aurélie has spoken at various events all over the globe, including SXSW, Strata + Hadoop World, the IAPP’s Data Protection Congress, Webit, and eMetrics summits, and has written several white papers on data privacy and privacy engineering best practices. Aurélie is a member of the European Data Protection Supervisor’s (EDPS) Ethics Advisory Group (EAG), cochairs the IEEE’s P7002—Data Privacy Process standard initiative, and serves as a training advisor to the International Association of Privacy Professionals (IAPP). Previously, she served as data governance and privacy advocate for leading data management platform (DMP) Krux Digital Inc. prior to its acquisition by Salesforce. She teaches privacy and ethics at IE Business School in Madrid and Solvay Business School in Brussels.

Presentations

Data governance and evolving privacy legislation: Daring to move beyond compliance Session

The EU's General Data Protection Regulation is an ambitious legal project to reinstate the rights of "data subjects" within an increasingly lucrative data ecosystem. Aurélie Pols explores the legal obligations on companies and their respective interpretations and looks at how scale and integrity will be safeguarded in the data we increasingly base decisions upon in the long term.

Harry Powell is director and head of advanced data analytics at Barclays.

Presentations

Making recommendations using graphs and Spark Session

Harry Powell and Raffael Strassnig demonstrate how to model unobserved customer preferences over businesses by thinking about transactional data as a bipartite graph and then computing a new similarity metric—the expected degrees of separation—to characterize the full graph.

Emma Prest is the general manager of DataKind UK, where she handles the day-to-day operations supporting the influx of volunteers and building understanding about what data science can do in the charitable sector. Emma sits on the Editorial Advisory Committee at the Bureau of Investigative Journalism. She was previously a program coordinator at Tactical Tech, providing hands-on help for activists using data in evidence-based campaigns. Emma holds an MA in public policy with a specialization in media, information, and communications from Central European University in Hungary and a degree in politics and geography from the University of Edinburgh, Scotland.

Presentations

How do you help charities do data? Session

Since its creation, DataKind has helped charities do some fantastic things with data science through volunteers from the data science community (that's you!). But charities often don't know what to do next. Duncan Ross and Emma Prest share lessons learned from DataKind's projects and outline a data maturity model for doing good with data.

Iñaki Puigdollers is a data scientist at Social Point, where he leads the analytics function in the biggest game of the company (Dragon City). Previously, Iñaki worked in data insights at Schibsted Media Group. He holds an MSc in statistical modeling.

Presentations

"Smartifying" the game Session

Low cost, big impact: this is what data science can bring to your business. Iñaki Puigdollers explores how the analytics department changed Social Point games, creating an even better gaming experience and business.

Marco has a background in computer science and cognitive science. He received his PhD degree in psychophysics. At the moment he works at the big data group of the methodology department of Statistics Netherlands. He is involved in the UNECE big data sandbox project.

Presentations

Relevancer: Finding and Labeling Relevant Information in Tweet Collections Session

Identifying relevant tweets in tweet collections that are gathered via key words is a huge challenge. This become exponentially harder as the ambiguity of the used key term and the size of the collection increases. We introduce our study on using unsupervised and supervised machine learning with linguistic insight to enable people identify relevant tweets for their needs.

Giovanni Quattrone is a lecturer in computing science in the Research Group in Applied Software Engineering in the Department of Computer Science at Middlesex University’s School of Science and Technology. He also is an honorary member of the Department of Computer Science at University College London, UK. Previously, Giovanni was a research fellow in the Geospatial Analytics and Computing Group at University College London and in the Department of Computer Science at University College London. He joined University College London thanks to the FP7-PEOPLE-2009-IEF Marie Curie Action.

Presentations

Algorithmic regulation Session

Sharing economy platforms are poorly regulated because there is no evidence upon which to draft policies. Daniele Quercia and Giovanni Quattrone propose a means for gathering evidence by matching web data with official socio-economic data and use data analysis to envision regulations that are responsive to real-time demands, contributing to the emerging idea of algorithmic regulation.

Daniele Quercia is currently building the Social Dynamics group at Bell Labs in Cambridge, UK. Daniele’s research focuses on the area of urban informatics and has received best paper awards from Ubicomp 2014 and ICWSM 2015 as well as an honorable mention from ICWSM 2013. Previously, he was a research scientist at Yahoo Labs, a Horizon senior researcher at the University of Cambridge, and a postdoctoral associate at MIT. Daniele has been named one of Fortune magazine’s 2014 Data All-Stars and has spoken about “happy maps” at TED. He holds a PhD from UC London. His thesis was sponsored by Microsoft Research and was nominated for BCS best British PhD dissertation in computer science.

Presentations

Algorithmic regulation Session

Sharing economy platforms are poorly regulated because there is no evidence upon which to draft policies. Daniele Quercia and Giovanni Quattrone propose a means for gathering evidence by matching web data with official socio-economic data and use data analysis to envision regulations that are responsive to real-time demands, contributing to the emerging idea of algorithmic regulation.

Phillip Radley is chief data architect on BT’s core Enterprise Architecture team, where he is responsible for data architecture across BT Group Plc. Based at BT’s Adastral Park campus in the UK, Phill currently leads BT’s MDM and big data initiatives, driving associated strategic architecture and investment roadmaps for the business. Phill has worked in IT and the communications industry for 30 years, mostly with British Telecommunications Plc., and his previous roles in BT include nine years as chief architect for infrastructure performance-management solutions from UK consumer broadband to outsourced Fortune 500 networks and high-performance trading networks. He has broad global experience, including with BT’s Concert global venture in the US and five years as an Asia Pacific BSS/OSS architect based in Sydney. Phill is a physics graduate with an MBA.

Presentations

Hadoop as a service: How to build and operate an enterprise data lake supporting operational and streaming analytics Session

If you have Hadoop clusters in research or an early-stage data lake and are considering strategic vision and goals, this session is for you. Phillip Radley explains how to run Hadoop as a shared service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably.

Karthik Ramasamy is the engineering manager and technical lead for real-time analytics at Twitter. He has two decades of experience working in parallel databases, big data infrastructure, and networking. He cofounded Locomatix, a company that specializes in real-time streaming processing on Hadoop and Cassandra using SQL, that was acquired by Twitter. Before Locomatix, he had a brief stint with Greenplum, where he worked on parallel query scheduling. Greenplum was eventually acquired by EMC for more than $300M. Prior to Greenplum, Karthik was at Juniper Networks, where he designed and delivered platforms, protocols, databases, and high availability solutions for network routers that are widely deployed on the internet. Before joining Juniper, at the University of Wisconsin he worked extensively in parallel database systems, query processing, scale out technologies, storage engines, and online analytical systems. Several of these research projects were later spun off as a company acquired by Teradata. Karthik is the author of several publications, patents, and Network Routing: Algorithms, Protocols and Architectures. He has a Ph.D. in computer science from the University of Wisconsin, Madison with a focus on databases.

Presentations

Speeding up Twitter Heron streaming by 5x Session

Twitter processes billions of events per day at the instant the data is generated. To achieve real-time performance, Twitter employs Heron, an open source streaming engine tailored for large-scale environments. Karthik Ramasamy and Maosong Fu share several optimizations implemented in Heron to improve throughput by 5x and reduce latency by 50–60%.

Miriam Redi is a Research Scientist in the Social Dynamics team at Bell Labs Cambridge. Her research focuses on content-based social multimedia understanding and culture analytics. In particular, she explores ways to automatically assess visual aesthetics, sentiment and creativity, and exploit the power of computer vision in the context of web, social media, and online communities. Miriam got her Ph.D. at the Multimedia group in EURECOM, Sophia Antipolis. After obtaining her PhD, she was a Postdoc in the Social Media group at Yahoo Labs Barcelona and a Research Scientist at Yahoo London.

Presentations

Keynote with Miriam Redi Keynote

Miriam Redi, Research Scientist in the Social Dynamics team at Bell Labs Cambridge.

Doron Reuter is head of business development for the ING Wholesale Banking Advanced Analytics team, where his goal is to make colleague wholesale bankers at ING aware of the potential of advanced analytics and facilitate their advanced analytics project ideas. Doron has been a corporate and investment banker for 14 years at BNP Paribas in London, Fortis, and ING in the Netherlands. Previously, Doran worked at an internet startup in the ’90s and at Rational Software (now IBM). Doron holds a bachelor of science in computer science and economics and a MBA. He was born in South Africa. He now lives in the Netherlands with his wife, Daniela, and his children, Mia and Etan. When family and work allow, Doron is treasurer of a charity, swims, ice skates, gyms, runs, and does pretty much anything else sporty that anyone feels like doing with him.

Presentations

Three years into changing ING Wholesale Banking from a B2B company into a data-driven enterprise FinData

In 2013, ING announced its Think Forward strategy to empower people in life and in business using advanced analytics (AA). To ING Wholesale Banking's corporate and investment bankers, big data, AA, and AI were a "retail thing." Doron Reuter explains how a team of 15 convinced 15,000 bankers that ING become data driven beyond BI and shares some lessons learned so far on this three-year journey.

Alberto Rey is head of data science at easyJet, where he leads easyJet’s efforts to adopt advance analytics within different areas of the business. Alberto’s background is in air transport and economics, and he has more than 15 years’ experience in the air travel industry. Alberto started his career in advanced analytics as a member of the Pricing and Revenue Management team at easyJet, working in the development of one of the most advanced pricing engines within the industry, where his team pioneered the implementation of machine-learning techniques to drive pricing. He holds an MSc in data mining and an MBA from Cranfield University.

Presentations

Growing a data-driven organization at easyJet Tutorial

Many large organizations want to develop data science capabilities, but the traditional complexity and legacy of such companies don’t allow a fast and agile evolution toward data-driven decision making. EasyJet is working toward becoming completely data driven. Alberto Rey shares real-world examples on how easyJet is tackling the challenges of scaling up its analytics capabilities.

Matthew Rocklin is an open source software developer focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on Dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.

Presentations

Dask: Flexible analytic computing for Python Session

Dask parallelizes Python libraries like NumPy, pandas, and scikit-learn, bringing a popular data science stack to the world of distributed computing. Matthew Rocklin discusses the architecture and current applications of Dask used in the wild.

Irene Ros is the director of data visualization at Bocoup and the program chair of OpenVis Conf, a two-day conference on data visualization on the open Web. Irene is an information visualization researcher and developer, making engaging, informative, and interactive data-driven stories, experiences, and exploratory interfaces on the web. Previously, she was a research developer at IBM Research’s Visual Communication Lab. Irene holds a BS in computer science from the University of Massachusetts Amherst.

Presentations

Visualizing the health of the internet with Measurement Lab Session

Measurement Lab is the largest collection of open internet performance data on the planet, with over five petabytes of information about the quality of experience on the internet and more data generated every day. Irene Ros shares recent work to develop a data processing pipeline, API, and visualizations to make the data more accessible.

Duncan Ross is data and analytics director at TES Global. Duncan has been a data miner since the mid-1990s. Previously at Teradata, Duncan created analytical solutions across a number of industries, including warranty and root cause analysis in manufacturing and social network analysis in telecommunications. In his spare time, Duncan has been a city councilor, chair of a national charity, founder of an award-winning farmers’ market, and one of the founding directors of the Institute of Data Miners. More recently, he cofounded DataKind UK and regularly speaks on data science and social good.

Presentations

How do you help charities do data? Session

Since its creation, DataKind has helped charities do some fantastic things with data science through volunteers from the data science community (that's you!). But charities often don't know what to do next. Duncan Ross and Emma Prest share lessons learned from DataKind's projects and outline a data maturity model for doing good with data.

Kate Ross-Smith is a data scientist at Mango Solutions who confesses to geeking out over data science’s puzzles, technology and shiny things. Specializing in helping define and secure data architectures, Kate loves the constant challenge of fitting the right technology together to solve super interesting business issues.

Presentations

Spark and R with sparklyr Tutorial

R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Kate Ross-Smith, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.

Neelesh Srinivas Salian is a software engineer on the Data Platform team at Stitch Fix, where he works closely with the Apache Spark ecosystem as part of the infrastructure group. Previously, he worked at Cloudera where he was working with Apache projects like YARN, Spark, and Kafka. He holds a master’s in computer science from North Carolina State University with a focus on cloud computing and a bachelor’s degree in engineering from the University of Mumbai, India.

Presentations

How to secure Apache Spark? Session

Security has been a large and growing aspect of distributed systems, specifically in the big data ecosystem, but it's an underappreciated topic within the Spark framework itself. Neelesh Srinivas Salian explains how detailed knowledge of the setup and an awareness of what to be looking out for in terms of problems and issues can help an organization move forward in the right way.

Mathew Salvaris is a data scientist at Microsoft. Previously, Mathew was a data scientist for a small startup that provided analytics for fund managers and a postdoctoral researcher at UCL’s Institute of Cognitive Neuroscience, where he worked with Patrick Haggard in the area of volition and free will, devising models to decode human decisions in real time from the motor cortex using electroencephalography (EEG), and a postdoctoral position at the University of Essex’s Brain Computer Interface group, where he worked on BCIs for computer mouse control. Mathew holds a PhD in brain computer interfaces and an MSc in distributed artificial intelligence.

Presentations

Speeding up machine-learning applications with the LightGBM library in real-time domains HDS

The speed of a machine-learning algorithm can be crucial in problems that require retraining in real time. Mathew Salvaris and Miguel González-Fierro introduce Microsoft's recently open sourced LightGBM library for decision trees, which outperforms other libraries in both speed and performance, and demo several applications using LightGBM.

Majken Sander is a data nerd, business analyst, and solution architect at TimeXtender. Majken has worked with IT, management information, analytics, BI, and DW for 20+ years. Armed with strong analytical expertise, She is keen on “data driven” as a business principle, data science, the IoT, and all other things data.

Presentations

Discover the business value of open data Tutorial

Majken Sander explains how to create a hub and start exploiting open data. Majken discusses which data can be found from external sources and how open data can add value by enhancing existing company data to gain new insights. There is a dataset out there for your business to become even more data driven. Join Majken to find it.

Kaz Sato is a staff developer advocate on the Cloud Platform team at Google, where he leads the developer advocacy team for machine-learning and data analytics products such as TensorFlow, the Vision API, and BigQuery. Kaz has been leading and supporting developer communities for Google Cloud for over seven years, is a frequent speaker at conferences, including Google I/O 2016, Hadoop Summit 2016 San Jose, Strata + Hadoop World 2016, and Google Next 2015 NYC and Tel Aviv, and has hosted FPGA meetups since 2013.

Presentations

TensorFlow in the wild; Or, the democratization of machine intelligence Session

TensorFlow is democratizing the world of machine intelligence. With TensorFlow (and Google's Cloud Machine Learning platform), anyone can leverage deep learning technology cheaply and without much expertise. Kazunori Sato explores how a cucumber farmer, a car auction service, and a global insurance company adopted TensorFlow and Cloud ML to solve their real-world problems.

Andrei Savu is a software engineer at Cloudera, where he’s working on Cloudera Director, a product that makes Hadoop deployments in cloud environments easy and more reliable for customers.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

How to optimally run Cloudera batch data analytic workflow in AWS Session

Cloudera Enterprise has made many focused optimizations in order leverage all of the cloud-native capabilities of AWS for the CDH platform. Andrei Savu and Philip Langdale take you through all the ins and outs of successfully running an end-to-end batch data analytic workflow in AWS.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning with TensorFlow 2-Day Training

This training will introduce TensorFlow's capabilities through its Python interface. It will move from building machine learning algorithms piece by piece to using the higher-level abstractions provided by TensorFlow. Students will use this knowledge to build machine-learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

This training will introduce TensorFlow's capabilities through its Python interface. It will move from building machine learning algorithms piece by piece to using the higher-level abstractions provided by TensorFlow. Students will use this knowledge to build machine-learning models on real-world data.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies, Inc. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for six years.

Presentations

Cloudy with a chance of on-prem Data 101

The cloud is becoming pervasive, but it isn’t always full of rainbows. Defining a strategy that works for your company or for your use cases is critical to ensuring success. Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations.

Jonathan Seidman is a software engineer on the Partner Engineering team at Cloudera. Previously, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly Media.

Presentations

Architecting a next-generation data platform Tutorial

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Mark Sellors is head of data engineering for Mango Solutions, where he helps clients run their data science operations in production-class environments. Mark has extensive experience in analytic computing and helping organizations in sectors from government to pharma to telecoms get the most from their data engineering environments.

Presentations

Spark and R with sparklyr Tutorial

R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Kate Ross-Smith, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.

Robin Senge is a senior big data scientist on an analytics team at inovex GmbH, where he applies machine learning to optimize supply chain processes for one of the biggest groups of retailers in Germany. Previously, he worked as a software engineer consulting and developing software for financial applications, such as trading and portfolio management systems. Robin holds an MSc in computer science and a PhD from the University of Marburg, where his research at the Computational Intelligence Lab focused on machine learning and fuzzy systems.

Presentations

Reliable prediction: Handling uncertainty HDS

Reliable prediction is the ability of a predictive model to explicitly measure the uncertainty involved in a prediction without feedback. Robin Senge shares two approaches to measure different types of uncertainty involved in a prediction.

Ben Sharma is CEO and cofounder of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions. With previous experience in technology leadership positions for NetApp, Fujitsu, and others, Ben’s expertise ranges from development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization, and storage. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes, and he holds two patents.

Presentations

Building a modern data architecture for scale Session

When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started.

Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

Presentations

Unraveling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value.

Tomer Shiran is cofounder and CEO of Dremio. Previously, Tomer was the VP of product at MapR, where he was responsible for product strategy, roadmap, and new feature development. As a member of the executive team, he helped grow the company from 5 employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He is the author of five US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion, the Israel Institute of Technology.

Presentations

Creating a virtual data lake with Apache Arrow Session

In most organizations, data is spread across multiple data sources, such as Hadoop/cloud storage, RDBMS, and NoSQL. Tomer Shiran and Jacques Nadeau offer an overview of Apache Arrow, an open source in-memory columnar technology that enables users to combine multiple data sources and expose them as a virtual data lake to users of Spark, SQL-on-Hadoop, Python, and R.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Unraveling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value.

Tanvi Singh is the chief analytics officer, CCRO at Credit Suisse, where she leads a team of 15+ data scientists and analytics SME globally in Zurich, New York, London, and Singapore that is responsible for delivering multimillion dollar projects in big data with leading Silicon Valley vendors in the space of regulatory technology (RegTech). Tanvi has 18 years of experience managing big data analytics, SAP business intelligence, data warehousing, digital analytics, and Siebel CRM platforms, with a focus on statistics, machine learning, text mining, and visualizations. She also has experience in quality as a Lean Six Sigma Black Belt. Tanvi holds a master’s degree in software systems from the University of Zurich.

Presentations

Surveillance and monitoring FinData

Regtech is one of the fastest growing areas in financial world. Tanvi Singh showcases the use of data science tools and techniques in this space and offers a holistic view of how to do surveillance and monitoring using a man + machine approach.

Vartika Singh is a solutions architect at Cloudera with over 10 years of experience applying machine-learning techniques to big data problems.

Presentations

Unraveling data with Spark using machine learning Tutorial

Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value.

.

Presentations

Machine learning with TensorFlow 2-Day Training

This training will introduce TensorFlow's capabilities through its Python interface. It will move from building machine learning algorithms piece by piece to using the higher-level abstractions provided by TensorFlow. Students will use this knowledge to build machine-learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

This training will introduce TensorFlow's capabilities through its Python interface. It will move from building machine learning algorithms piece by piece to using the higher-level abstractions provided by TensorFlow. Students will use this knowledge to build machine-learning models on real-world data.

Ben Spivey is a principal solutions architect at Cloudera providing consulting services for large financial-services customers. Ben specializes in Hadoop security and operations. He is the coauthor of Hadoop Security from O’Reilly Media (2015).

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

As chief data architect at Uber, M. C. Srivas worries about all data issues from trips, riders and partners, and pricing to analytics, self-driving cars, security, and data-center planning. Previously, M. C. was CTO and founder of MapR Technologies, a top Hadoop distribution; worked on search at Google, developing and running the core search engine that powered many of Google’s special verticals like ads, maps, and shopping; was chief architect at Spinnaker Networks (now Netapp), which formed the basis of Netapp’s flagship NAS products; and ran the Andrew File System team at Transarc, which was acquired by IBM. M. C. holds an MS from the University of Delaware and a BTech from IIT-Delhi.

Presentations

Real-time intelligence gives Uber the edge Keynote

M. C. Srivas covers the technologies underpinning the big data architecture at Uber and explores some of the real-time problems Uber needs to solve to make ride sharing as smooth and ubiquitous as running water, explaining how they are related to real-time big data analytics.

Tristan Stevens is a senior solutions architect at Cloudera, where he helps clients across EMEA with their Hadoop implementations. Tristan’s background is in the UK defence sector. He has also worked on large-scale, highly available, business-critical analytics platforms, with more recent experience in gaming, telecoms, and financial services.

Presentations

Near-real-time ingest with Apache Flume and Apache Kafka at 1 million-events-per-second scale Session

Vodafone UK’s new SIEM system relies on Apache Flume and Apache Kafka to ingest over 1 million events per second. Tristan Stevens discusses the architecture, deployment, and performance-tuning techniques that enables the system to perform at IoT-scale on modest hardware and at a very low cost.

Ben Stopford is an engineer and architect working on the Apache Kafka Core Team at Confluent (the company behind Apache Kafka). A specialist in data, both from a technology and an organizational perspective, Ben previously spent five years leading data integration at a large investment bank, using a central streaming database. His earlier career spanned a variety of projects at Thoughtworks and UK-based enterprise companies. He writes at Benstopford.com.

Presentations

Elastic streams: Dynamic data redistribution in Apache Kafka Session

Dynamic data rebalancing is a complex process. Ben Stopford and Ismael Juma explain how to do data rebalancing and use replication quotas in the latest version of Apache Kafka.

Raffael Strassnig is vice president and data scientist at Barclays, where he pushes the boundaries of predictive systems. Previously, Raffael worked on problems in dynamic advertising at Amazon and real-time analytics at Microsoft. In his free time, he enjoys solving maths riddles, programming in Scala, and cooking. He studied software engineering at the University of Technology in Graz, mathematics at the University of Vienna, and computational intelligence at the University of Technology in Vienna.

Presentations

Making recommendations using graphs and Spark Session

Harry Powell and Raffael Strassnig demonstrate how to model unobserved customer preferences over businesses by thinking about transactional data as a bipartite graph and then computing a new similarity metric—the expected degrees of separation—to characterize the full graph.

Bargava Subramanian is an India-based data scientist at Cisco Systems. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimization in Python and R around the world. Bargava holds a master’s degree in statistics from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations

Interactive data visualizations using Visdown Tutorial

Crafting interactive data visualizations for the web is hard—you're stuck using proprietary tools or must become proficient in JavaScript libraries like D3. But what if creating a visualization was as easy as writing text? Amit Kapoor and Bargava Subramanian outline the grammar of interactive graphics and explain how to use declarative markdown-based tool Visdown to build them with ease.

Sriskandarajah “Suho” Suhothayan is a technical Lead at WSO2, where he focuses on real-time and big data technologies and provides technology consulting on customer engagements such as quick start programs and architecture reviews. Suho is also a visiting lecturer at Robert Gordon University’s IIT Campus in Sri Lanka, where he teaches big data programming. He drives the design and development of Siddhi-CEP—a high-performance complex event processing engine that emerged from his academic studies—and has published several papers on real-time complex event processing systems. He holds a BSc in engineering from the University of Moratuwa, Sri Lanka, where he specialized in computer science and engineering.

Presentations

Transport for London: Using data to keep London moving Tutorial

Transport for London (TfL) and WSO2 have been working together on broader integration projects focused on getting the most efficient use out of London transport. Roland Major and Sriskandarajah Suhothayan explain how TfL and WSO2 bring together a wide range of data from multiple disconnected systems to understand current and predicted transport network status.

David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cybersecurity. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science along with master’s degrees in both computer science and business administration.

Presentations

When models go rogue: Hard-earned lessons about using machine learning in production Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Sameer Tilak is a data scientist with the Medical Informatics group at Kaiser Permanente, where he leads the big data initiative. Previously, Sameer was an associate research scientist at the University of California, San Diego (UCSD), where he was PI and co-PI on several projects funded by the NSF and private foundations. He has extensive experience in the areas of sensor network, big data technologies, and their applications in building real-world large-scale applications. Sameer holds a PhD in computer science from SUNY Binghamton and an MS in computer science from the University of Rochester and SUNY Binghamton.

Presentations

Spark ML: State of the union and a real-world case study from Kaiser Permanente Tutorial

Sameer Tilak and Anand Iyer offer an overview of recent developments in the Spark ML library and common real-world usage patterns, focusing on how Kaiser Permanente uses Spark ML for predictive analytics on healthcare data. Sameer and Anand share lessons learned building and deploying distributed machine-learning pipelines at Kaiser Permanente.

Eric Tilenius is CEO of BlueTalon, the leader in secure enterprise data integration across silos. Previously, Eric was an executive in residence at Scale Venture Partners, a general manager at Zynga, and CEO of two venture-backed startups— Netcentives (which he cofounded) and Answers.com, both of which had successful IPOs. He also held product management leadership positions at Oracle Corporation and Intuit and was a consultant with Bain & Company. Eric holds an MBA from Stanford University’s Graduate School of Business, where he was an Arjay Miller Scholar, and a bachelor’s degree in economics (summa cum laude) from Princeton University.

Presentations

EU GDPR as an opportunity to address both big data security and compliance Session

Many businesses will have to address EU GDPR as they deploy big data projects. This is an opportunity to rethink data security and deploy a flexible policy framework adapted to big data and regulations. Eric Tilenius explains how consistent visibility and control at a granular level across data domains can address both security and GDPR compliance.

Fergal Toomey is a specialist in network data analytics and a founder of Corvil, where he has been intensively involved in developing key product innovations directly applicable to managing IT system performance. Fergal has been involved in the design and development of innovative measurement and analysis algorithms for the past 12 years. Previously, he was an assistant professor at the Dublin Institute for Advanced Studies, where he was a member of the Applied Probability Group, which also included Raymond Russell, Corvil’s CTO. Fergal holds an MSc in physics and a PhD in applied probability theory, both from Trinity College, Dublin.

Presentations

Safeguarding electronic stock trading: Challenges and key lessons in network security Session

Fergal Toomey and Graham Ahearne outline the challenges facing network security in complex industries, sharing key lessons learned from their experiences safeguarding electronic trading environments to demonstrate the utility of machine learning and machine-time network data analytics.

Steve Touw is the cofounder and CTO of Immuta. Steve has a long history of designing large-scale geotemporal analytics across the US intelligence community, including some of the very first Hadoop analytics as well as frameworks to manage complex multitenant data policy controls. He and his cofounders at Immuta drew on this real-world experience to build a software product to make data experimentation easier. Previously, Steve was the CTO of 42Six Solutions (acquired by Computer Sciences Corporation), where he led a large big data services engineering team. Steve holds a BS in geography from the University of Maryland.

Presentations

GDPR, data privacy, anonymization, minimization. . .oh my! Session

The global populace is asking for the IT industry to be held responsible for the safe-guarding of individual data. Steve Touw examines some of the data privacy regulations that have arisen and covers design strategies to protect personally identifiable data while still enabling analytics.

Herman van Hövell tot Westerflier is a Spark committer working on Spark SQL at Databricks. Previously, Herman was a consultant working for clients in banking, manufacturing, and logistics. His interests include database systems, optimization, and simulation.

Presentations

A deep dive into Spark SQL's Catalyst optimizer Session

Herman van Hövell tot Westerflier offers a deep dive into Spark SQL's Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how new and upcoming features are implemented using Catalyst.

Vincent Van Steenbergen is a certified Spark consultant and trainer at w00t data, where he helps companies scale big data and machine-learning solutions into production-ready applications and provides Spark training and consulting to a broad range of companies across Europe and the US. Vincent is a coorganizer of the PAPIs.io international conference.

Presentations

Spark machine-learning pipelines: The good, the bad, and the ugly Session

Spark is now the de facto engine for big data processing. Vincent Van Steenbergen walks you through two real-world applications that use Spark to build functional machine-learning pipelines (wine price prediction and malware analysis), discussing the architecture and implementation and sharing the good, the bad, and the ugly experiences he had along the way.

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Vinithra was previously a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Eduard Vazquez is head of research at Cortexica Vision Systems. His main research topics cover the study of color and perception, segmentation, medical imaging, and object recognition. Eduard holds a PhD in computer vision from Universitat Autonoma de Barcelona, where he was also a lecturer in artificial intelligence and expert systems.

Presentations

Challenges in commercializing deep learning Tutorial

Cortexica had the first commercial implementation of a deep convolutional network in a GPU back in 2010. However, in the real world, running a CNN is not always a possibility. Eduard Vazquez discusses current challenges that commercial applications based on this technology are facing and how some of them can be tackled.

Gil Vernik is a researcher in IBM’s Storage Clouds, Security, and Analytics group, where he works with Apache Spark, Hadoop, object stores, and NoSQL databases. Gil has more than 25 years of experience as a code developer on both the server side and client side and is fluent in Java, Python, Scala, C/C++, and Erlang. He holds a PhD in mathematics from the University of Haifa and held a postdoctoral position in Germany.

Presentations

Hadoop and object stores: Can we do it better? Session

Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speed up for DFSIO on Hadoop and a 500% speed up for Terasort on Spark.

Dean Wampler is the architect for fast data products at Lightbend, where he specializes in scalable, distributed big data and streaming systems using tools like Spark, Mesos, Akka, Cassandra, and Kafka (the SMACK stack). Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly Media. He is a contributor to several open source projects and the co-organizer of several conferences around the world and several user groups in Chicago. Dean can be found on Twitter as @deanwampler.

Presentations

Just enough Scala for Spark Tutorial

Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs.

Stream all the things! Session

"Stream" is a buzzword for several things that share the idea of timely handling of neverending data. Big data architectures are evolving to be stream oriented. Microservice architectures are inherently message driven. Dean Wampler defines "stream" based on characteristics for such systems, using specific tools as examples, and argues that big data and microservices architectures are converging.

Galiya Warrier is a data solution architect at Microsoft, where she helps enterprise customers along their journey of adoption of Microsoft Azure Data technologies from big data workloads to machine learning and chatbots.

Presentations

Conversation interfaces for data science models Session

Galiya Warrier demonstrates how to apply a conversational interface (in the form of a chatbot) to communicate with an existing data science model.

Colin White is a managing director and technology fellow within the Engineering division at Goldman Sachs. Colin is global head of the Workflow group within Enterprise Platforms, a core engineering group that builds services and capabilities to decrease time to market and drive the level of standardization across workflow-based applications firm-wide.

Presentations

Software industrialization meets big data at Goldman Sachs FinData

Colin White discusses Goldman Sachs's industrialization program, under which it is digitizing processes, rules, and data in order to decrease cost, reduce time to market, and manage the risk of repetitive business processes. Goldman Sachs is taking models seriously; the data that is generated offers real insights into how to optimize its business.

Edd Wilder-James is a technology analyst, writer, and entrepreneur based in California. He’s helping transform businesses with data as VP of strategy for Silicon Valley Data Science. Formerly Edd Dumbill, Edd was the founding program chair for the O’Reilly Strata conferences and chaired the Open Source Convention for six years. He was also the founding editor of the peer-reviewed journal Big Data. A startup veteran, Edd was the founder and creator of the Expectnation conference-management system and a cofounder of the Pharmalicensing.com online intellectual-property exchange. An advocate and contributor to open source software, Edd has contributed to various projects such as Debian and GNOME and created the DOAP vocabulary for describing software projects. Edd has written four books, including O’Reilly’s Learning Rails.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business.

The business case for deep learning, Spark, and friends Data 101

Deep learning is white-hot at the moment, but why does it matter? Developers are usually the first to understand why some technologies cause more excitement than others. Edd Wilder-James relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring.

Gary Willis is a data scientist at ASI with a diverse background in applying machine-learning techniques to commercial data science problems. Gary holds a PhD in statistical physics; his research looked at Markov Chain Monte Carlo simulations of complex systems.

Presentations

What does your postcode say about you? A technique to understand rare events based on demographics Session

Gary Willis offers a technical presentation of a novel algorithm that uses public data and an unsupervised tree-based learning algorithm to help companies leverage locational data they have on their clients. Along the way, Gary also discusses a wide range of further potential applications.

Ian Wrigley has taught tens of thousands of students over the last 25 years in subjects ranging from C programming to Hadoop development and administration. Ian is currently the director of education services at Confluent, where he heads the team building and delivering courses focused on Apache Kafka and its ecosystem.

Presentations

Real-time data pipelines with Apache Kafka Tutorial

Ian Wrigley demonstrates how to use Kafka Connect and Kafka Streams to build real-world, real-time streaming data pipelines—using Kafka Connect to ingest data from a relational database into Kafka topics as the data is being generated and then using Kafka Streams to process and enrich the data in real time before writing it out for further analysis.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud strategy and solutions. Before joining Cloudera, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Reynold Xin is a cofounder and chief architect at Databricks as well as an Apache Spark PMC member and release manager for Spark’s 2.0 release. Prior to Databricks, Reynold was pursuing a PhD at the UC Berkeley AMPLab, where he worked on large-scale data processing.

Presentations

A behind-the-scenes look into Spark's API and engine evolutions Session

Reynold Xin looks back at the history of data systems, from filesystems, databases, and big data systems (e.g., MapReduce) to "small data" systems (e.g., R and Python), covering the pros and cons of each, the abstractions they provide, and the engines underneath. Reynold then shares lessons learned from this evolution, explains how Spark is developed, and offers a peek into the future of Spark.

Víctor Zabalza is a data engineer at ASI Data Science. He has a background in high-energy astrophysics, with 10 years of research experience that included work on the origin of gamma-ray emission from systems within our galaxy.

Presentations

Automated data exploration: Building efficient analysis pipelines with Dask Session

Data exploration usually entails making endless one-use exploratory plots. Victor Zabalza shares a Python package based on Dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work. Victor offers an overview of the tool and explains how it was built and why it will become essential in the first steps of every data science project.

Tristan Zajonc is a senior engineering manager at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and an MPA in international development from Harvard and a BA in economics from Pomona College.

Presentations

Making self-service data science a reality Session

Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality.

Yingsong Zhang is a data scientist at ASI, where she has worked on everything from social media data to special data from clients to build predictive models. Yingsong has published over 10 first-author research papers in top journals and conferences in the field of signal/image processing and has accumulated extensive experience in algorithm design and information representation. She recently completed a three-year postdoc project at Imperial College London developing sampling theory and the application system. Yingsong holds a BA in mathematics, an MSc in artificial intelligence and pattern recognition from one of China’s top universities, and a PhD in signal and image processing from Cambridge University.

Presentations

Gaining additional labels for data: An introduction to using semisupervised learning for real problems Tutorial

There are sometimes occasions where the labels on data are insufficient. In such situations, semisupervised learning can be of great practical value. Yingsong Zhang explores illustrative examples of how to come up with creative solutions, derived from textbook approaches.