Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Speakers

Experts and innovators from around the world share their insights and best practices. New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Saif Addin Ellafi is a software developer at John Snow Labs, where he’s the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Saif has wide experience in problem solving and quality assurance in the banking and finance industry.

Presentations

Spark NLP in action: Intelligent, high-accuracy fact extraction from long financial documents Session

Spark NLP natively extends Spark ML to provide natural language understanding capabilities with performance and scale that was not possible to date. David Talby, Saif Addin Ellafi, and Paul Parau explain how Spark NLP was used to augment the Recognos smart data extraction platform in order to automatically infer fuzzy, implied, and complex facts from long financial documents.

Mahmood Adil is a medical director at NHS National Services Scotland, where he oversees the health intelligence and health protection functions of Scotland and provides leadership in utilizing the most comprehensive national health data and informatics capability to improve broad ranging clinical outcomes from cancer to infection control through new service models and supporting digital health. He is also honorary professor of health intelligence and service effectiveness at the University of Glasgow. Mahmood has 25 years of medical, public health, and executive management experience. Previously, he was national quality and efficiency advisor at the Department of Health (England), a World Bank advisor, a visiting professor at the Manchester Business School, and a fellow of NHS Institute of Innovation and Improvement. He is an alumnus of the Harvard Kennedy School, the UK National Civil Service College, the Institute of Healthcare Improvement (USA), and the Yale School of Public Health.

Presentations

Data Collaboratives Session

Jude McCorry and Mahmood Adil offer an overview of Data Collaboratives, a new form of collaboration beyond the public-private partnership model, in which participants from different sectors  exchange data, skills, leadership, and knowledge to solve complex problems facing children in Scotland and worldwide.

Nidhi Aggarwal is the coauthor of the the Medium publication Radical Product. An entrepreneur who is passionate about building radical products, most recently Nidhi led product, strategy, marketing, and finance at data integration company Tamr. Previously, she cofounded cloud configuration management startup qwikLABS (acquired by Google), which remains the exclusive platform used by AWS customers and partners worldwide to create and deploy on-demand lab environments in the cloud, and worked at McKinsey & Company, where she focused on big data and cloud strategy. She holds six US patents. Nidhi holds a PhD in computer science from the University of Wisconsin-Madison.

Presentations

Measure what matters: How your measurement strategy can reduce opex Tutorial

These days it’s easy for companies to say, "We measure everything!” The problem is, most popular metrics may not be appropriate or relevant for your business. Measurement isn’t free and should be done strategically. Radhika Dutt, Geordie Kaytes, and Nidhi Aggarwal explain how to align measurement with your product strategy so you can measure what matters for your business.

Alasdair Allan is a director at Babilim Light Industries and a scientist, author, hacker, maker, and journalist. An expert on the internet of things and sensor systems, he’s famous for hacking hotel radios, deploying mesh networked sensors through the Moscone Center during Google I/O, and for being behind one of the first big mobile privacy scandals when, back in 2011, he revealed that Apple’s iPhone was tracking user location constantly. He’s written eight books and writes regularly for Hackster.io, Hackaday, and other outlets. A former astronomer, he also built a peer-to-peer autonomous telescope network that detected what was, at the time, the most distant object ever discovered.

Presentations

Executive Briefing: Data privacy in the age of the internet of things Session

The increasing ubiquity of the internet of things has put a new focus on data privacy. Big data is all very well when it's harvested quietly and stealthily, but when your things tattle on you behind your back, it's a very different matter altogether. Alasdair Allan explains why the internet of things brings with it a whole new set of big data problems that can't be ignored.

Saeed Amen is the founder of Cuemacro, where he consults and publishes research on systematic trading for clients including major quant funds and data companies such as Bloomberg. He is also a cofounder of the Thalesians. Over the past decade, Saeed Amen has created systematic trading strategies at major investment banks including Lehman Brothers and Nomura. He has developed many popular open source Python libraries including finmarketpy and is the author of Trading Thalesians: What the Ancient World Can Teach Us about Trading Today. Saeed has presented his work at conferences and institutions including the IMF, the Bank of England, and the Federal Reserve Board.

Presentations

Monetizing your data for financial markets Findata

Saeed Amen explores various sources of big data and alternative data for financial markets and details ways corporates can monetize their datasets by distributing them to financial market participants, thus creating new revenue streams and marketing opportunities.

Using Python to analyze financial markets Session

Saeed Amen explores Python libraries that can be used at the various stages of financial analysis, including time series analysis, visualization, structuring data, and storing market data.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He’s taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He’s widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time systems with Spark Streaming and Kafka 2-Day Training

To handle real-time big data, you need to solve two difficult problems: How do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time systems with Spark Streaming and Kafka (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Carlo Appugliese is a program director for machine learning engineering at IBM, where he focuses on IBM Analytics, Watson, and Cloud. A technologist with a track record of leveraging emerging technologies and trends to drive business transformation, Carlo has held a number of roles, including computer programmer, manager of application development, and director of innovation.

Presentations

Putting AI to work for business: It's a journey. (sponsored by IBM) Session

What was once science fiction has now become reality as multiple AI consumer-based solutions have hit the market over last few years. In turn, consumers have become more comfortable interacting with AI. But has AI really lived up to the hype? For consumers, perhaps not yet. However, AI for business is a different (and more valuable) animal. Carlo Appugliese details how business can put AI to work.

André Araujo is a principal solutions architect at Cloudera. An experienced consultant with a deep understanding of the Hadoop stack and its components and a methodical and keen troubleshooter who loves making things run faster, André is skilled across the entire Hadoop ecosystem and specializes in building high-performance, secure, robust, and scalable architectures to fit customers’ needs.

Presentations

Securing and governing hybrid, cloud, and on-premises big data deployments, step by step Tutorial

Hybrid big data deployments present significant new security risks. Security admins must ensure a consistently secured and governed experience for end users and administrators across multiple workloads. Mark Donsky, Steffen Maerkl, and André Araujo share best practices for meeting these challenges as they walk you through securing a Hadoop cluster.

Brian Arnold is the lead architect for the Data Historian platform at Monsanto, where he is responsible for guiding the technical direction and implementation for the platform. Previously, he assisted in implementing Monsanto’s enterprise Kafka platform. Brian has 10 years of experience as an IT professional, working on a large-scale ecommerce website and implementing various big data applications. Brian is passionate about big data, the cloud, data science, and functional programming and is experienced in technologies and building recommendations system platforms and enterprise data lakes. Brian holds a BS in computer engineering with a minor in mathematics from Marquette University.

Presentations

You call it data lake; we call it Data Historian. Session

There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security.

Carme Artigas is the founder and CEO of Synergic Partners, a strategic and technological consulting firm specializing in big data and data science (acquired by Telefónica in 2015). She has more than 20 years of extensive expertise in the telecommunications and IT fields and has held several executive roles in both private companies and governmental institutions. Carme is a member of the Innovation Board of CEOE and the Industry Affiliate Partners at Columbia University’s Data Science Institute. An in-demand speaker on big data, she has given talks at several international forums, including Strata Data Conference, and collaborates as a professor in various master’s programs on new technologies, big data, and innovation. Carme was recently recognized as the only Spanish woman among the 30 most influential women in business by Insight Success. She holds an MS in chemical engineering, an MBA from Ramon Llull University in Barcelona, and an executive degree in venture capital from UC Berkeley’s Haas School of Business.

Presentations

Delivering business value to Pepsico through mobility and retail data DCS

The entire retail sector has started to transform its core business in order to meet new consumers demands. Carme Artigas and Nuria Bombardó explain how Synergic Partners has helped Pepsico leverage its data to achieve direct business impact in revenue growth.

David Asboth is a data scientist at Cox Automotive Data Solutions, where he spends his days creating value from messy and incomplete data. Previously, he was a software developer. David holds an MSc in data science.

Presentations

Scaling data science (teams and technologies) Session

Cox Automotive is the world’s largest automotive service organization, which means it can combine data from across the entire vehicle lifecycle. Cox is on a journey to turn this data into insights. David Asboth and Shaun McGirr share their experience building up a data science team at Cox and scaling the company's data science process from laptop to Hadoop cluster.

Harvinder Atwal is head of customer insight and marketing optimization at Moneysupermarket, where he leads a team of data scientists delivering data-driven customer insight, marketing optimization, and predictive analytics to help the company’s 23 million customers. Harvinder is interested in data science, machine learning, big data technologies, and how they can be used to improve customer experience.

Presentations

DataOps: Nine steps to transform your data science impact Session

Harvinder Atwal offers an entertaining and practical introduction to DataOps, a new and independent approach to delivering data science value at scale, and shares experience-based solutions for increasing your velocity of value creation, including Agile prioritization and collaboration, new operational processes for an end-to-end data lifecycle, and more.

Eran Avidan is a senior software engineer in Intel’s Advanced Analytics Department. Eran enjoys everything distributed, from Spark and Kafka to Kubernetes and TensorFlow. He holds an MS in computer science from the Hebrew University of Jerusalem.

Presentations

Real-time deep learning on video streams Session

Deep learning is revolutionizing many domains within computer vision, but doing real-time analysis is challenging. Eran Avidan offers an overview of a novel architecture based on Redis, Docker, and TensorFlow that enables real-time analysis of high-resolution streaming video.

Amr Awadallah is the cofounder and CTO at Cloudera. Previously, Amr was an entrepreneur in residence at Accel Partners, served as vice president of product intelligence engineering at Yahoo, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr’s first startup, VivaSmart, was acquired by Yahoo in July 2000. Amr holds bachelor’s and master’s degrees in electrical engineering from Cairo University, Egypt, and a PhD in electrical engineering from Stanford University.

Presentations

Moving machine learning and analytics to hyperspeed Keynote

Imagine the value you could drive in your business if you could accelerate your journey to machine learning and analytics. Amr Awadallah, Ankit Tharwani, and Bala Chandrasekaran explain how Barclays has driven innovation in real-time analytics and machine learning with Apache Kudu, accelerating the time to value across multiple business initiatives, including marketing, fraud prevention, and more.

Marton Balassi is a solutions architect at Cloudera, where he focuses on data science and stream processing with big data tools. Marton is a PMC member at Apache Flink and a regular contributor to open source. He is a frequent speaker at big data-related conferences and meetups, including Hadoop Summit, Spark Summit, and Apache Big Data.

Presentations

Improving computer vision models at scale Session

Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. Marton Balassi, Mirko Kämpf, and Jan Kunigk share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable.

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Marton Balassi, Steven Totman, and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Louise Beaumont has a number of perspectives on the open future. She is a strategic advisor to the Publicis Groupe and its clients on open banking, imagining the open future that consumers and small businesses deserve. As cochair of techUK’s Open Banking and Payments Working Group, Louis makes the case for financial services policies capable of delivering the future that consumers and small businesses deserve. She is also a member of the New Payments System Operator’s End User Advisory Council, advising on fintech, and a member of Working and Stakeholder Groups at the Open Banking Implementation Entity, where she argues for the sound foundations required to deliver the services consumers and small businesses deserve. Louise is also an investor in and advisor to startups in fintech, ad tech, and millennial podcasting.

Presentations

So, you want to be successful in the open future? Keynote

Louise Beaumont explores the five characteristics of companies that choose to succeed.

Welcome to your open future Findata

Welcome to your open future. Louise Beaumont explains what’s happening and why, who’s got the power, who’s in the race, and how you win.

Jason Bell specializes in high-volume streaming systems for large retail customers, using Kafka in a commercial context for the last five years. Jason was section editor for Java Developer’s Journal, has contributed to IBM developerWorks on autonomic computing, and is the author of Machine Learning: Hands On for Developers and Technical Professionals.

Presentations

Learning how to design automatically updating AI with Apache Kafka and Deeplearning4j Session

Jason Bell offers an overview of a self-learning knowledge system that uses Apache Kafka and Deeplearning4j to accept data, apply training to a neural network, and output predictions. Jason covers the system design and the rationale behind it and the implications of using a streaming data with deep learning and artificial intelligence.

Albert Bifet is a professor and head of the Data, Intelligence, and Graphs (DIG) Group at Télécom ParisTech and a scientific collaborator at École Polytechnique. A big data scientist with 10+ years of international experience in research, Albert has led new open source software projects for business analytics, data mining, and machine learning at Huawei, Yahoo, the University of Waikato, and UPC. At Yahoo Labs, he cofounded Apache scalable advanced massive online analysis (SAMOA), a distributed streaming machine learning framework that contains a programing abstraction for distributed streaming ML algorithms. At the WEKA Machine Learning Group, he co-led massive online analysis (MOA), the most popular open source framework for data stream mining with more than 20,000 downloads each year. Albert is the author of Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams and the editor of the Big Data Mining special issue of SIGKDD Explorations. He was cochair of the industrial track at ECML PKDD, BigMine, and the data streams track at ACM SAC. He holds a PhD from BarcelonaTech.

Presentations

StreamDM: Advanced data science with Spark Streaming Session

Heitor Murilo Gomes and Albert Bifet offer an overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei's Noah’s Ark Lab and Télécom ParisTech.

Enric Biosca is a data and analytics manager at the everis Data Innovation Center (eDIN). Enric is a computer engineer with more than 10 years of experience in data and information architecture. He holds a master’s degree in data science from the Universitat de Barcelona.

Presentations

The eAGLE accelerator: How to speed up migrations from legacy ETL to big data implementations Session

Enric Biosca offers an overview of the eAGLE accelerator, which speeds up migration processes from legacy ETL to big data implementations by enabling auditing, lineage, and translation of legacy code for big data. Along the way, Enric demonstrates how graph and automatic translation technologies help companies reduce their migration times.

Lee Blum is Verint’s Product Manager for Big Data Analytics in the Cyber Intelligence division. He is responsible for Big Data solutions on Large Scale Cyber systems, providing rapid ingestion, processing and advanced analytics of data, collected by high-end cyber probes. Lee has over 15 years of experience in IP networks, back-end development and petabyte-scale Big Data analytics.

Presentations

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense Session

Lee Blum offers an overview of Verint's large-scale cyber-defense system built to serve its data scientists with versatile analytic operations on petabytes of data and trillions of records, covering the company's extremely challenging use case, decision considerations, major design challenges, tips and tricks, and the system’s overall results.

Nuria Bombardo is currently Insights Revenue Management Lead for Europe at Pepsico, responsible for building capabilities to ensure shopper-centric pack-price, promotion and pricing strategies. Nuria’s background includes over 20 years working for leading companies in FMCG, hospitality industry and Corporate Finance Consultancy-M&A.

She has held positions in Revenue Management Strategy, Analytics and Insights, Category Management, Sales and Strategical Planning. Her passion is behavioral science for robust shopper understanding to turn insights into successful action plants and data-driven decision making.

Nuria led a Big Data Project which enabled a successful differentiated execution by store based on shopper behavior. The stores impacted by the project achieved a double-digit rate in a declining channel.

Presentations

Delivering business value to Pepsico through mobility and retail data DCS

The entire retail sector has started to transform its core business in order to meet new consumers demands. Carme Artigas and Nuria Bombardó explain how Synergic Partners has helped Pepsico leverage its data to achieve direct business impact in revenue growth.

Behzad Bordbar is a mathematician, software engineer, and big data technical instructor at Cloudera, where he teaches courses on Hadoop, Hive, Impala, and Spark. Behzad has worked in academia for over 12 years and has been a visiting scientist at HP, BT, and IBM.

Presentations

Data science and machine learning with Apache Spark (Day 2) (SOLD OUT) Training Day 2

Behzad Bordbar demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Data science and machine learning with Apache Spark (SOLD OUT) 2-Day Training

Behzad Bordbar demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby and Claudiu Branzan lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Mikio Braun is a principal engineer for search at Zalando, one of Europe’s biggest fashion platforms. He worked in research for a number of years before becoming interested in putting research results to good use in the industry. Mikio holds a PhD in machine learning.

Presentations

Machine learning for time series: What works and what doesn't Session

Time series data has many applications in industry, in particular predicting the future based on historical data. Mikio Braun offers an overview of time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations for what works and what doesn't.

Machine learning: Research and industry Keynote

Mikio Braun has worked in both research and industry and draws on this experience to share insights on how these two areas are the same (and how they are different). He then details how deep learning might change the game again.

Pascal Bugnion is a data engineer at ASI Data Science, where he is working to build SherlockML, a collaborative platform for elite data scientists. Pascal is a maintainer of Jupyter widgets, a library for building user interfaces in Jupyter notebooks, and the author of Scala for Data Science. He holds a PhD in theoretical condensed matter physics from Cambridge University.

Presentations

Human-in-the-loop data science with Jupyter widgets Session

Jupyter widgets let you create lightweight, interactive graphical interfaces directly in Jupyter notebooks. Pascal Bugnion demonstrates how to use Jupyter widgets to implement human-in-the-loop machine learning with highly interactive user interfaces.

Tobias Bürger leads the Platform and Architecture Group within the Big Data, Machine Learning, and Artificial Intelligence Department at BMW Group, where he is responsible for the global big data platform that is the core technical pillar of the BMW data lake and is used across different divisions inside the BMW Group, spanning areas such as production, aftersales, and ConnectedDrive.

Presentations

Data-driven ecosystems in the automotive industry Session

The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Tobias Bürger and Joshua Görner discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments.

Elisa Celis is a senior research scientist at the School of Computer and Communication Sciences at EPFL. Previously, she was a research scientist at Xerox Research, where she was the worldwide head of the Crowdsourcing and Human Computation research thrust. Her research focuses on studying social and economic questions that arise in the context of the internet and spans multiple areas including fairness in AI/ML, social computing, online learning, network science, and mechanism design. Elisa is the recipient of the Yahoo! Key Challenges Award and the China Theory Week Prize. She holds a BSci in computer science and mathematics from Harvey Mudd College and a PhD in computer science from the University of Washington.

Presentations

Fairness and diversity in online social systems Session

There is a pressing need to design new algorithms that are socially responsible in how they learn and socially optimal in the manner in which they use information. Elisa Celis explores the emergence of bias in algorithmic decision making and presents first steps toward developing a systematic framework to control biases in classical problems, such as data summarization and personalization.

Simon Chan is a senior director of product management for Salesforce Einstein, where he oversees platform development and delivers products that empower everyone to build smarter apps with Salesforce. Simon is a product innovator and serial entrepreneur with more than 14 years of global technology management experience in London, Hong Kong, Guangzhou, Beijing, and the Bay Area. Previously, Simon was the cofounder and CEO of PredictionIO, a leading open source machine learning server (acquired by Salesforce). Simon holds a BSE in computer science from the University of Michigan, Ann Arbor, and a PhD in machine learning from University College London.

Presentations

The journey of machine learning platform adoption in enterprise Session

The promises of AI are great, but taking the steps to implement AI within an enterprise is challenging. The secret behind enterprise AI success often traces back to the underlying platform that accelerates AI development at scale. Based on years of experience helping executives establish AI product strategies, Simon Chan helps you discover the AI platform journey that is right for your business.

Bala Chandrasekaran is a managing director at Barclays, where he heads up data platforms and services for the company’s UK Division ad leads numerous transformative data initiatives, including setting up the new data warehouse, operational data store, and machine learning capability for the bank. Bala has over 25 years of experience working at the intersection of financial services and technology, with a diverse international experience spanning multiple continents in varied roles from conceptualization and design to development and deployment of complex core banking systems.

Presentations

Moving machine learning and analytics to hyperspeed Keynote

Imagine the value you could drive in your business if you could accelerate your journey to machine learning and analytics. Amr Awadallah, Ankit Tharwani, and Bala Chandrasekaran explain how Barclays has driven innovation in real-time analytics and machine learning with Apache Kudu, accelerating the time to value across multiple business initiatives, including marketing, fraud prevention, and more.

Guillaume Chaslot is the founder of both consulting firm IntuitiveAI and nonprofit AlgoTransparency. Previously, Guillaume worked at Google. He holds a PhD in artificial intelligence from Maastricht University.

Presentations

Finding bias in social media recommendations Session

An increasing number of ex-Google and ex-Facebook employees state that social media is starting to control us rather than the other way around. How can we determine if social media is a pure reflection of people's interests or if it pushes us toward specific narratives? Guillaume Chaslot explores methodologies to find out which narratives are favored by social media recommendation engines.

Étienne Chassé St-Laurent is a data scientist working in enterprise analytics at Aviva Canada, where he is building a new generation of predictive models by leveraging his background in statistics and actuarial science together with machine learning to solve issues specific to the insurance industry. In the past, he’s worked at Statistics Canada and in the pharmaceutical industry but has found a home for the last 10 years in insurance, six of them in R&D. He holds a BSc in actuarial science and an MSc in statistics. Étienne is mainly fueled by sugar and always receptive to book recommendations.

Presentations

Risk-sharing pools: Winning zero-sum games through machine learning Session

Risk-sharing pools allow insurers to get rid of risks they are forced to insure in highly regulated markets. Insurers thus cede both the risk and its premium. But are they ceding the right risk or simply giving up premium? Baiju Devani and Étienne Chassé St-Laurent share an applied machine learning approach that leverages an ensemble of models to gain a distinctive market advantage.

Jean-Luc Chatelain is a managing director for Accenture Digital and the CTO for Accenture Applied Intelligence, where he focuses on helping Accenture customers become information-powered enterprises by architecting state-of-the-art big data solutions. Previously, Jean-Luc was the executive vice president of strategy and technology for DataDirect Networks Inc. (DDN), the world’s largest privately held big data storage company, where he led the company’s R&D efforts and was responsible for corporate and technology strategy; a Hewlett-Packard fellow and vice president and CTO of information optimization responsible for leading HP’s information management and business analytics strategy; founder and CTO of Persist Technologies (acquired by HP), a leader in hyperscale grid storage and archiving solutions whose technology is the basis of the HP Information Archiving Platform IAP; and CTO and senior vice president of strategic corporate development at Zantaz, a leading service provider of information archiving solutions for the financial industry, where he played an instrumental role in the development of the company’s services and raised millions of dollars in capital for international expansion. He has been a board member of DDN since 2007. Jean-Luc studied computer science and electrical engineering in France and business at Emory University’s Goizueta Executive Business School. He is bilingual in French and English and has also studied Russian and classical Greek.

Presentations

Executive Briefing: Becoming a data-driven enterprise—A maturity model Session

A data-driven enterprise maximizes the value of its data. But how do enterprises emerging from technology and organization silos get there? Teresa Tung and Jean-Luc Chatelain explain how to create a data-driven enterprise maturity model that spans technology and business requirements and walk you through use cases that bring the model to life.

Marty Cochrane is the director of solution architecture for EMEA at Arundo. Previously, he led software development at Statkraft, Europe’s biggest renewable energy company. Marty’s background is in software engineering, with a specialization in industrial applications. In his spare time, he competes professionally around the world racing superbikes. Marty has also been developing his own technical platform that is used in top teams around Europe to develop riders’ skills.

Presentations

Real-time motorcycle racing optimization DCS

In motorcycle racing, riders make snap decisions that determine outcomes spanning from success to grievous injury. Fausto Morales and Marty Cochrane explain how they use a custom software-based edge agent and machine learning to automate real-time maneuvering decisions in order to limit tire slip during races, thereby mitigating risk and enhancing competitive advantage.

Ira Cohen is a cofounder and chief data scientist at Anodot, where he’s responsible for developing and inventing the company’s real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

The app trap: Why every mobile app and mobile operator needs anomaly detection Session

The mobile world has so many moving parts that a simple change to one element can cause havoc somewhere else, resulting in issues that annoy users and cause revenue leaks. Ira Cohen outlines ways to use anomaly detection to track everything mobile, from the service and roaming to specific apps, to fully optimize your mobile offerings.

Darren Cook is a director at QQ Trend, a financial data analysis and data products company. Darren has over 20 years of experience as a software developer, data analyst, and technical director and has worked on everything from financial trading systems to NLP, data visualization tools, and PR websites for some of the world’s largest brands. He is skilled in a wide range of computer languages, including R, C++, PHP, JavaScript, and Python. Darren is the author of two books, Data Push Apps with HTML5 SSE and Practical Machine Learning with H2O, both from O’Reilly.

Presentations

Using LSTMs to aid professional translators Session

Darren Cook demonstrates how to use LSTMs, state-of-the-art tokenizers, dictionaries, and other data sources to tackle translation, focusing on one of the most difficult language pairs: Japanese to English.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Findata welcome Tutorial

Hosts Alistair Croll and Robert Passarella welcome you to Findata Day.

Thursday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Anthony Culligan is a partner at Sumetis Limited, CEO of Roolo, and product CEO of SETL, where he leads the strategic decisions that identify which problems SETL’s blockchain technology looks to overcome. Anthony has over 35 years of experience in finance and technology. Previously, he was a director at Robert Fleming Securities, where he managed a quantitative fund and invested in complex hedge funds; senior vice president at JPMorgan Chase; CIO at Aida Capitol; and founder, partner, and CEO of F&C Partners, a fund of hedge fund managers.

Presentations

Session with Anthony Culligan Findata

Session with Anthony Culligan

Nick Curcuru is vice president of enterprise information management at Mastercard, where he’s responsible for leading a team that works with organizations to generate revenue through smart data, architect next-generation technology platforms, and protect data assets from cyberattacks by leveraging Mastercard’s information technology and information security resources and creating peer-to-peer collaboration with their clients. Nick brings over 20 years of global experience successfully delivering large-scale advanced analytics initiatives for such companies as the Walt Disney Company, Capital One, Home Depot, Burlington Northern Railroad, Merrill Lynch, Nordea Bank, and GE. He frequently speaks on big data trends and data security strategy at conferences and symposiums, has published several articles on security, revenue management, and data security, and has contributed to several books on the topic of data and analytics.

Presentations

Security, governance, and cloud analytics, oh my! Session

Having so many cloud-based analytics services available is a dream come true. However, it's a nightmare to manage proper security and governance across all those different services. Nikki Rouda and Nick Curcuru share advice on how to minimize the risk and effort in protecting and managing data for multidisciplinary analytics and explain how to avoid the hassle and extra cost of siloed approaches.

Principal Solutions Architect at Weaveworks, previously a Principal Engineer at MapR.

Presentations

Making stateless containers reliable and available even with stateful applications Session

The flexibility advantage conferred by containers depends on their ephemeral nature, so it’s useful to keep containers stateless. However, many applications require state—access to a scalable persistence layer that supports real mutable files, tables, and streams. Paul Curtis demonstrates how to make containerized applications reliable, available, and performant, even with stateful applications.

Bryan Cutler is a software engineer at IBM’s Spark Technology Center, where he works on big data analytics. He is a contributor to Apache Spark in the areas of ML, SQL, Core, and Python and a committer for the Apache Arrow project. Bryan is interested in pushing the boundaries to build high-performance tools for analytics and machine learning.

Presentations

Model parallelism in Spark ML cross-validation Session

Tuning a Spark ML model using cross-validation involves a computationally expensive search over a large parameter space. Nick Pentreath and Bryan Cutler explain how enabling Spark to evaluate models in parallel can significantly reduce the time to complete this process for large workloads and share best practices for choosing the right configuration to achieve optimal resource usage.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Giuseppe D’alessio is a data engineer and chapter lead in the Analytics Department at ING. During his time at ING, he has worked on several international projects in data analytics and security; currently, he’s lead for the fast data chapter on streaming applications that make customer communication extremely personalized and relevant. Giuseppe holds a master’s degree in computer engineering with a focus on artificial intelligence, pattern recognition, and software development.

Presentations

DevOps at ING Analytics: Combining data engineering with data operations Session

Giuseppe D'alessio details ING's DevOps journey, covering its impact on people, processes and tools, best practices, and pitfalls. Giuseppe concludes with a concrete example of using analytics and streaming technology on real-time applications.

Presentations

GPU-accelerated threat detection with GOAI Session

Joshua Patterson and Mike Wendt explain how NVIDIA used GPU-accelerated open source technologies to improve its cyberdefense platforms by leveraging software from the GPU Open Analytics Initiative (GOAI) and how the company accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration.

Danielle Dean is the technical director of machine learning at iRobot. Previously, she was a principal data science lead at Microsoft. She holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill.

Presentations

Executive Briefing: Lessons learned managing data science projects—Adopting a team data science process Session

Danielle Dean covers the basics of managing data science projects, including the data science lifecycle, and offers an overview of an internal approach at Microsoft called the Team Data Science Process (TDSP). Join in to learn more about the typical priorities of data science teams and the keys to success on engaging and creating value with data science.

Baiju Devani is vice president of enterprise analytics at Aviva Canada, where he leads a team of data scientists in application of analytics across all aspects of the insurance business, from core insurance activities, such as pricing and risk selection, to the development of cutting-edge robotic processes and the application of machine learning algorithms to areas such as claims processing and optimizing client acquisition. Previously, Baiju led the Analytics Group at IIROC, an entity overseeing Canadian equity and fixed-income markets, where he guided decision making in todays machine-driven (algorithmic) markets, embedded machine learning, and other algorithmic surveillance capabilities for market oversight and spearheaded IIROC’s primary research into market-structure issues and emerging technologies, such as blockchains; was part of the early team at fintech startup OANDA, which disrupted retail foreign-exchange markets, where he was responsible for building and leading the data engineering, data science, and growth teams; and founded Fstream, a SaaS provider for ingesting and analyzing high-frequency streaming data. Baiju holds both a BSc and MSc in computer science from Queen’s University. He developed his data chops working on large biological datasets as part of his graduate work and later at the Ontario Cancer Institute.

Presentations

Risk-sharing pools: Winning zero-sum games through machine learning Session

Risk-sharing pools allow insurers to get rid of risks they are forced to insure in highly regulated markets. Insurers thus cede both the risk and its premium. But are they ceding the right risk or simply giving up premium? Baiju Devani and Étienne Chassé St-Laurent share an applied machine learning approach that leverages an ensemble of models to gain a distinctive market advantage.

Philipp Diesinger is global head of data science at Boehringer Ingelheim, where he develops the company’s data science capabilities following a quant-level approach driven by the philosophy that the impact of data science is maximized through bright minds and scaled through technology. The artificial intelligence solutions he and his team have been working on have enabled Boehringer Ingelheim to adopt industry-leading positions in key areas. Philipp is a passionate data scientist who firmly believes that data-driven scientific problem solving will significantly transform industries and economies on a large scale. He started his career as a researcher at the Massachusetts Institute of Technology following his PhD in theoretical physics and has worked for a number of companies, including SAP Global Data Science Consulting, where he developed smart data science solutions for globally operating customers and scaled the data science team’s life science engagements.

Presentations

Data, AI, and innovation in the enterprise Session

What are the latest initiatives and use cases around data and AI? How are data and AI reshaping industries? How do we foster a culture of data and innovation within a larger enterprise? What are some of the challenges of implementing AI within the enterprise setting? Michael Li moderates a panel of experts in different industries to answer these questions and more.

Mark Donsky is a director of product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogenous data environments, where he leads product management. Previously, Mark led data management and governance solutions at Cloudera, and he’s held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the Western University, Ontario, Canada.

Presentations

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Syed Rafice outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Securing and governing hybrid, cloud, and on-premises big data deployments, step by step Tutorial

Hybrid big data deployments present significant new security risks. Security admins must ensure a consistently secured and governed experience for end users and administrators across multiple workloads. Mark Donsky, Steffen Maerkl, and André Araujo share best practices for meeting these challenges as they walk you through securing a Hadoop cluster.

Jim Dowling is the CEO of Logical Clocks, an associate professor at KTH Royal Institute of Technology in Stockholm, and lead architect of Hopsworks, an open source data and AI platform. He’s a regular speaker at big data industry conferences. He holds a PhD in distributed systems from Trinity College Dublin.

Presentations

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops Session

Distributed deep learning can increase the productivity of AI practitioners and reduce time to market for training models. Hadoop can fulfill a crucial role as a unified feature store and resource management platform for distributed deep learning. Jim Dowling offers an introduction to writing distributed DL applications, covering TensorFlow and Apache Spark frameworks that make distribution easy.

Ted Dunning is the chief technology officer at MapR, an HPE company. He’s also a board member for the Apache Software Foundation, a PMC member, and committer on a number of projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Rendezvous with AI Session

Ted Dunning offers an overview of the rendezvous architecture, which is geared to deal with much of the complexity involved in deploying models to production, thus allowing more time to be spent thinking and doing real data science. Ted covers the ideas behind the architecture, practical scenarios, and advantages and disadvantages of the architecture.

Radhika Dutt is is the coauthor of the Medium publication Radical Product and a cocreator of the Radical Product practical toolkit. Radhika is a product executive who has participated in four exits, two of which were companies she founded, including Lobby7, a venture-backed company that created an early version of Siri back in 2000 (acquired by Scansoft/Nuance). Most recently, she led product management at Allant, where she and her team built a SaaS product for TV advertising. (Allant’s TV division was subsequently acquired by Acxiom.) Previously, she worked at Avid, growing its broadcast business by building a product suite to address pain points of broadcasters worldwide as they were moving from tape to digital media; led strategy at the telecom startup Starent Networks (acquired by Cisco for $2.9B); and founded Likelii, a company that offered consumers a “Pandora for wine” (acquired by Drync). Too long ago to admit, Radhika graduated from MIT with an SB and MEng in electrical engineering. She speaks nine languages.

Presentations

Measure what matters: How your measurement strategy can reduce opex Tutorial

These days it’s easy for companies to say, "We measure everything!” The problem is, most popular metrics may not be appropriate or relevant for your business. Measurement isn’t free and should be done strategically. Radhika Dutt, Geordie Kaytes, and Nidhi Aggarwal explain how to align measurement with your product strategy so you can measure what matters for your business.

Erik Elgersma is director of strategic analysis at FrieslandCampina, one of the world’s largest dairy companies, where he has led the company’s strategic analysis practice for 18 years. Previously, Erik held positions in corporate strategy, business development, and innovation management at FrieslandCampina. He speaks and lectures frequently on the topics of strategic analysis, data management, data science, and competitive intelligence to clients including the Royal Dutch Association of Information Professionals, the US Navy’s Center for Naval Analysis, Cimi.Con, the Institute for Competitive Intelligence, IAFIE-Europe, and Brunel University, London.

Presentations

Predictive analytics for FMCG business strategies and tactics DCS

Erik Elgersma details how a global food company uses multiple sources of monthly actualized data and smart algorithms to predict future commodity prices as input for its commercial policies.

Wael Elrifai is vice president of solution engineering for big data, the IoT, and AI at Hitachi Vantara. An engineer, author, and speaker in the AI and IoT spaces, Wael has worked with corporate and government clients in North America, Europe, the Middle East, and East Asia across a number of industry verticals. He holds graduate degrees in electrical engineering and economics and is a member of the Association for Computing Machinery, the Special Interest Group for Artificial Intelligence, the Royal Economic Society, and the Royal Institute of International Affairs.

Presentations

The IoT and AI for good (sponsored by Hitachi Vantara) Session

Wael Elrifai shares his experiences working in the IoT and AI spaces, covering complexities, pitfalls, and opportunities to explain why innovation isn’t just good for business—it's a societal imperative.

Dan Enthoven works in partnerships at Domino Data Lab, where he helps customers get the most of their data science programs. Previously, Dan worked at a range of data science-driven companies, including Nuance Communications and Monster Worldwide. Over his career, he has focused on natural language processing, recruiting, and employee performance analytics. Dan holds a BA and an MBA from Stanford University.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Dan Enthoven outlines a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Christina Erlwein-Sayer is senior quantitative analyst and researcher at OptiRisk Systems, where she works on financial analytics, in particular, models and tools for portfolio construction and fixed income risk assessment. Her main research interest lies in regime-switching models for finance. Previously, she was a researcher and consultant in the Financial Mathematics Department at Fraunhofer ITWM, Kaiserslautern, Germany. She currently leads the Eurostars project SenRisk on the development of sentiment-enhanced risk assessment tools. She is also member of Quantess London, the first community in London designed for female quants who have a passion in data, science, and finance. Quantess mentors and promotes females from across seniority and industries with a passion in quantitative fields to discuss practical research ideas. Christina holds a PhD in mathematics from Brunel University, London.

Presentations

Macroeconomic news sentiment: Enhanced risk assessment for sovereign bond spreads Findata

Christina Erlwein-Sayer explains how to enhance the modeling and forecasting of sovereign bond spreads by considering quantitative information gained from macroeconomic news sentiment, using a number of large news analytics datasets.

Olga Ermolin is a senior business intelligence engineer at MLS Listings, where she is responsible for standardizing the schema of the company’s real estate database across multiple real estate hosting companies as well as maintaining day-to-day data integrity and scalability. She is designing and developing various Business Intelligence tools that enable clients to visualize and analyze real estate trends and performances. scalability. She created the company’s BI product that enables clients to visualize and analyze real estate trends and performance.

Presentations

Using Siamese CNNs for removing duplicate entries from real estate listing databases Session

Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology.

Sergey Ermolin is a software solutions architect for deep learning, Spark analytics, and big data technologies at Intel. A Silicon Valley veteran with a passion for machine learning and artificial intelligence, Sergey has been interested in neural networks since 1996, when he used them to predict aging behavior of quartz crystals and cesium atomic clocks made by Hewlett-Packard. Sergey holds an MSEE and a certificate in mining massive datasets from Stanford and BS degrees in both physics and mechanical engineering from California State University, Sacramento.

Presentations

Using Siamese CNNs for removing duplicate entries from real estate listing databases Session

Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology.

Moty Fania is a principal engineer and the CTO of the Advanced Analytics Group at Intel, which delivers AI and big data solutions across Intel. Moty has rich experience in ML engineering, analytics, data warehousing, and decision-support solutions. He led the architecture work and development of various AI and big data initiatives such as IoT systems, predictive engines, online inference systems, and more.

Presentations

A high-performance system for deep learning inference and visual inspection Session

Moty Fania explains how Intel implemented an AI inference platform to enable internal visual inspection use cases and shares lessons learned along the way. The platform is based on open source technologies and was designed for real-time streaming and online actuation.

Danyel Fisher is a principal design researcher at Honeycomb.io. Danyel’s work focuses on ways to help users interact with data more easily through data visualization and analytics, particularly by visualizing big data, logfile, and trace data. Previously, he spent 13 years at Microsoft Research, where among other things, he looked at ways to make big data analytics faster and more interactive with incremental visualization. Danyel is the coauthor, with Miriah Meyer, of Making Data Visual. He holds an MS from UC Berkeley and a PhD from UC Irvine.

Presentations

Making data visual: A practical session on using visualization for insight Tutorial

Danyel Fisher and Miriah Meyer explore the human side of data analysis and visualization, covering operationalization, the process of reducing vague problems to specific tasks, and how to choose a visual representation that addresses those tasks. Along the way, they also discuss single views and explain how to link them into multiple views.

Dave Fitch is head of operations at The Data Lab, where he is responsible for managing project development, contracting, and HR teams, as well as all aspects of The Data Lab’s sponsored project delivery. Dave also leads a range of strategic projects, including the Cancer Innovation Challenge.

Presentations

How can data help treat cancer? Lessons from Scotland's Cancer Innovation Challenge DCS

Scotland has some of the world's best cancer data and some of the world's best data scientists and data companies. But what Scotland doesn't have is very good cancer outcomes. So how can we use data, our skills and our networks to deliver better cancer treatments and results? Dave Fitch shares lessons learned from two years delivering Scotland's Cancer Innovation Challenge.

Jeff Fletcher is a systems engineer at Cloudera, where he helps customers build big data infrastructure. Jeff has been involved in internet technology all his professional life. Previously, he worked on the initial internet infrastructure team and managed aspects of the Johannesburg Beltel installation at Telkom; designed and implemented new internet products and services at Sprint (which became UUNET which became Verizon Business); founded Antfarm Networking Technologies, South Africa’s first streaming and webcasting company; and led the product development team at Internet Solutions (then IS). He does occasional consulting for corporate companies looking to move beyond pie charts. Jeff was shortlisted for an Information Is Beautiful award in 2015. He is the creator of Limn.co.za, a blog dedicated to the art of data visualization. Jeff holds a degree in electrical engineering from Witwatersrand University.

Presentations

Data visualization in a big data world Session

As big data adoption grows, Apache Hadoop, Apache Spark, and machine learning technologies are increasingly being used to analyze ever-larger datasets, but we still have to keep telling stories about the data and making sure the message is clear. Jeff Fletcher details the tools and techniques that are relevant to data visualization practitioners working with large datasets and predictive models.

Christine Foster is Managing Director for Innovation at the Alan Turing Institute, where she is responsible for driving forward the institute’s goal to translate its data science and artificial intelligence research into real-world impact by forging connections between the Turing’s science activities and industry, public sector, and third sector needs to broaden the institute’s engagement with partners and extend its reach into industry. Previously, Christine advised Virgin Media on implementing machine learning models to personalize customer interactions and Liberty Global on building a world-class data science team, held leadership positions at a fintech startup, American Express, and EMI Music, built digital analytics teams, implemented predictive models, and generally worked at the intersection of data science and business. Originally from Canada, Christine started her business career as a strategy consultant with Bain & Company. She holds an MBA from INSEAD and a BA in economics from the University of Toronto.

Presentations

Out of the lab and into real life Keynote

There is a common conception that artificial intelligence will change business. But as researchers at the Alan Turing Institute (the national center for data science and AI) well know, a new algorithm alone does not change the world. Christine Foster explores how businesses and researchers can find common ground and how today’s academic papers turn into tomorrow’s data science.

Eugene Fratkin is a director of engineering at Cloudera, heading Cloud R&D. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

Running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Jason Wang, Eugene Fratkin, and Mael Ropars detail new paradigms to effectively run production-level pipelines with minimal operational overhead. Join in to learn how to remove barriers to data discovery, metadata sharing, and access control.

Giselle is the Founder of Zingr.io, a FinTech startup enabling digital payments via mobile phones in developing countries. Built on Blockchain technology, Zingr empowers the unbanked and underbanked population, as well as small businesses through a platform that makes accepting payments online and peer-to-peer seamless, fast and secure.

As an interdisciplinary technologist, she started her career as a Software Developer in Investment Banking where she gained a wealth of experience in her roles working on and managing high profile Technology projects on both the technical and business side.

Presentations

Fireside chat with Gisele Frederick Findata

Gisele Frederick in conversation with Alistair Croll.

Yupeng Fu is a founding member and senior architect at Alluxio and a PMC member of the Alluxio open source project. Previously, Yupeng worked at Google building big data analytics platforms and Palantir, where he led the efforts building the company’s storage solution. Yupeng holds a BS and an MS from Tsinghua University and has completed coursework toward a PhD at UCSD.

Presentations

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks Session

Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.

Barbara Fusinska is a machine learning strategic cloud engineering manager at Google with a strong software development background. Previously, she was at a variety of different companies like ABB, Base, Trainline, and Microsoft, where she gained experience in building diverse software systems, ultimately focusing on the data science and machine learning field. Barbara believes in the importance of data and metrics when growing a successful business. In her free time, Barbara enjoys programming activities and collaborating around data architecture. She can be found on Twitter as @BasiaFusinska and blogs at http://barbarafusinska.com.

Presentations

Introduction to natural language processing with Python Tutorial

Natural language processing techniques help address tasks like text classification, information extraction, and content generation. Barbara Fusinska offers an overview of natural language processing and walks you through building a bag-of-words representation, using Python and its machine learning libraries, and then using it for text classification.

Ketan Gangatirkar is the vice president of engineering for job seeker products at Indeed.

Presentations

The artful science of metrics: Measurements that work Session

Quantitative measurement is the key to scaling businesses, processes, and products and making them better. It sounds easy: just pick a number and improve it. However, actually choosing a metric is an exploration of a many-dimensional space with no map and no guide. Until now. Join Ketan Gangatirkar to learn how to choose the right metrics so you can build a better product and a better business.

Jonathan Genah is IT director of information management at DHL Supply Chain where he drives the data & analytics and customer visibility agenda. Jonathan has served in a number of roles at DHL Supply Chain driving innovation and standardization. Jonathan holds an MSC in Mechanical Engineering & Robotics from the Politecnico of Milano.

Presentations

How DHL is increasing efficiency and reducing distance traveled across the warehouse with the IoT DCS

DHL has partnered with Conduce to provide a human interface that provides real-time visualizations that track and analyze distance traveled by personnel and warehouse equipment, all calibrated around a center of activity. Michael Troughton explains how this immersive data visualization gives DHL unprecedented insight to evaluate and act on everything that occurs in its warehouses.

Konstantinos Georgatzis is a data scientist in client projects at QuantumBlack, where he leads the development of methodologies to optimize commercial performance for pharma and healthcare clients, focusing on clinical trial and real-world patient outcomes. He is also developing ways to further improve algorithms that are frequently used within QuantumBlack. Konstantinos has extensive experience in developing machine learning methods, especially for healthcare analytics using biomedical and clinical data. Konstantinos has published eight articles in international machine learning and medical conferences, four of them as first author, and presented in multiple conferences. He also holds a position as a reviewer for international machine learning conferences. He holds a PhD from the University of Edinburgh, where his research focused on developing novel machine learning methods and applying them to better model the biosignals of intensive care unit patients.

Presentations

Interpretable AI: Can we trust machine learning? Session

Konstantinos Georgatzis and Martha Imprialou explain how to interpret the predictions given by your black-box model and how machine learning is helping to drive decision making today.

Aurélien Géron is a machine learning consultant at Kiwisoft and author of the best-selling O’Reilly book Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. Previously, he led YouTube’s video classification team, was a founder and CTO of Wifirst, and was a consultant in a variety of domains: finance (JPMorgan and Société Générale), defense (Canada’s DOD), and healthcare (blood transfusion). He also published a few technical books (on C++, WiFi, and internet architectures), and he’s a lecturer at the Dauphine University in Paris. He lives in Singapore with his wife and three children.

Presentations

Deep computer vision for manufacturing Session

Convolutional neural networks (CNN) can now complete many computer vision tasks with superhuman ability. This is will have a large impact on manufacturing, by improving anomaly detection, product classification, analytics, and more. Aurélien Géron details common CNN architectures, explains how they can be applied to manufacturing, and covers potential challenges along the way.

Naveed Ghaffar is the cofounder of Narrative Economics, a startup that is leading research and development in the emerging field of natural language understanding as it pertains to the spread of popular narratives across the world. A serial entrepreneur and innovator in data management and data science technologies, Naveed has founded three startups in this domain over the past six years and continues to coach and mentor a number of London-based startups. Most recently, Naveed was chief engineer for KPMG McLaren, where he was responsible for overall product management for the company’s suite of data analytics and simulation solutions. Naveed’s background is in data analytics and data governance. He is recognized as a thought leader on privacy by design, design thinking, and the policy and technical effects of the EU GDPR regulations. Naveed holds a degree in law (LLB honors) and an MSc from the University of Birmingham. He is a Certified Scrum Master, design thinking coach, and jobs-to-be-done innovator.

Presentations

Narrative extraction: Analyzing the world’s narratives through natural language understanding Session

Narratives are significant vectors of rapid change in culture, economic behavior, and the Zeitgeist of a society. Narrative economics studies the impact of popular human-interest stories on economic fluctuations. Naveed Ghaffar and Rashed Iqbal outline a framework that uses natural language understanding to extract and analyze narratives in human communication.

Daniel Gilbert is director of data at News UK. Previously, Daniel was a data science consultant working with publishers in the UK and US.

Presentations

Revolutionizing the newsroom with artificial intelligence Session

In the era of 24-hour news and online newspapers, editors in the newsroom must quickly and efficiently make sense of the enormous amounts of data that they encounter and make decisions about their content. Daniel Gilbert and Jonathan Leslie discuss an ongoing partnership between News UK and Pivigo in which a team of data science trainees helped develop an AI platform to assist in this task.

Zachary Glassman is a data scientist in residence at the Data Incubator. Zachary has a passion for building data tools and teaching others to use Python. He studied physics and mathematics as an undergraduate at Pomona College and holds a master’s degree in atomic physics from the University of Maryland.

Presentations

Hands-on data science with Python 2-Day Training

Zachary Glassman offers a foundation in building intelligent business applications using machine learning, walking you through all the steps of developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend these models into two applications using real-world datasets.

Hands-on data science with Python (Day 2) Training Day 2

Zachary Glassman offers a foundation in building intelligent business applications using machine learning, walking you through all the steps of developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend these models into two applications using real-world datasets.

Sean Glover is a software engineer specializing in Apache Kafka and its ecosystem on the Fast Data Platform team at Lightbend, which is building a next-generation big data platform distribution with a focus on stream processors, machine learning, and operations ease of use. Sean has several years’ experience helping Global 5,000 companies build data streaming platforms using technologies such as Kafka, Spark, and Akka.

Presentations

Kafka in jail: Running Kafka in container-orchestrated clusters Session

Kafka is best suited to run close to the metal on dedicated machines in static clusters, but these clusters are quickly becoming extinct. Companies want mixed-use clusters that take advantage of every resource available. Sean Glover offers an overview of leading Kafka implementations on DC/OS and Kubernetes to explore how reliably they run Kafka in container-orchestrated clusters.

Joshua Goerner is a data scientist at BMW, where he specializes in working with sensor data extracted from connected vehicles to glean insights about customer behavior and driving patterns. His major research interests cover the reproducibility of data science projects and the fusion of data science and modern software engineering. Previously, he spent years in the pharmaceutical industry.

Presentations

Data-driven ecosystems in the automotive industry Session

The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Tobias Bürger and Joshua Görner discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments.

Heitor Murilo Gomes is a researcher at Télécom ParisTech focusing on machine learning—particularly, evolving data streams, concept drift, ensemble methods, and big data streams. He coleads the streamDM open data stream mining project.

Presentations

StreamDM: Advanced data science with Spark Streaming Session

Heitor Murilo Gomes and Albert Bifet offer an overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei's Noah’s Ark Lab and Télécom ParisTech.

Miguel González-Fierro is a senior data scientist at Microsoft UK, where he helps customers leverage their processes using big data and machine learning. Previously, he was CEO and founder of Samsamia Technologies, a company that created a visual search engine for fashion items, allowing users to find products using images instead of words, and founder of the Robotics Society of Universidad Carlos III, which developed different projects related to UAVs, mobile robots, small humanoids competitions, and 3D printers. Miguel also worked as a robotics scientist at Universidad Carlos III of Madrid and King’s College London, where his research focused on learning from demonstration, reinforcement learning, computer vision, and dynamic control of humanoid robots. He holds a BSc and MSc in electrical engineering and an MSc and PhD in robotics.

Presentations

Distributed training of deep learning models Session

Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for.

Irene Gonzálvez is a product manager at Spotify. Passionate about innovation and the transformation of business values and customers’ needs into new technical solutions, Irene combines highly technical expertise with accurate planning and leadership capabilities.

Presentations

Big data, big quality: Data quality at Spotify Session

Irene Gonzálvez shares Spotify's process for ensuring data quality, covering why and how the company became aware of its importance, the products it has developed, and future strategy.

Martin Goodson is the chief scientist and CEO of Evolution AI, where he specializes in large-scale natural language processing. Martin has designed data science products that are in use at companies like Dun & Bradstreet, Time Inc., John Lewis, and Condé Nast. Previously, Martin was a statistician at the University of Oxford, where he conducted research on statistical matching problems for DNA sequences. He runs the largest community of machine learning practitioners in Europe, Machine Learning London, and convenes the CBI/Royal Statistical Society roundtable, AI in Financial Services. Martin’s work has been covered by publications such as the Economist, Quartz, Business Insider, TechCrunch, and others.

Presentations

On the limits of decision making with artificial intelligence Session

How can AI become part of our business processes? Should we entrust critical decisions to completely autonomous systems? Drawing on projects from businesses and UK government agencies, Martin Goodson explains how to increase confidence in AI systems and manage the transition to an AI-driven organization.

Charaka Goonatilake is CTO at Panaseer, where he designs and delivers big data solutions that enable chief information security officers and their teams to gain visibility into the true state of security within their business to improve cyber hygiene and reduce cyber risk exposure. Charaka has been immersed in big data technologies since the very early days of Hadoop and has hands-on experience using Hadoop in the enterprise to produce data-driven insights. Over the past eight years, across Panaseer and BAE Systems Applied Intelligence, Charaka has architected and engineered Hadoop-based data platforms for a range of cybersecurity use cases, from security analytics for threat detection to threat intelligence management and cybersecurity risk management.

Presentations

Architecting data platforms for cybersecurity Session

Data is becoming a crucial weapon to secure an organization against cyber threats. Charaka Goonatilake shares strategies for designing effective data platforms for cybersecurity using big data technologies, such as Spark and Hadoop, and explains how these platforms are being used in real-world examples of data-driven security.

Richard Goyder is the head of data science at IMC Business Architecture, where he helps companies change the way they think about and use their big data through AI. Richard came to big data as a user, running data-intensive departments, credit portfolio management, and performance management and compensation for Canada’s largest bank. Previously, he worked in consulting with BCG and in financial services in the UK and Canada. Richard holds degrees from the University of Oxford and INSEAD.

Presentations

Blind men and elephants: What’s missing from your big data? Session

Big data analytics tends to focus on what is easily available, which is by and large data about what has already happened, the implicit assumption being that past behavior will predict future behavior. Organizations already possess data they aren’t exploiting. Barry Singleton and Richard Goyder explain how, with the right tools, it can be used to develop far more powerful predictive algorithms.

Presentations

Elastic map matching using Cloudera Altus and Apache Spark Session

Map-matching applications exist in almost every telematics use case and are therefore crucial to all car manufacturers. Timo Graen and Robert Neumann detail the architecture behind Volkswagen Commercial Vehicle’s Altus-based map-matching application and lead a live demo featuring a map matching job in Altus.

Matthias Graunitz is a big data architect at Audi’s Competence Center for Big Data and Business Intelligence, where he is responsible for the architectural framework of the Hadoop ecosystem, a separate Kafka cluster, and the data science tool kits provided by the Center of Competence for all business departments at Audi. Matthias has more than 10 years’ experience in the field of business intelligence and big data.

Presentations

Audi's journey to an enterprise big data platform Session

Carsten Herbe and Matthias Graunitz detail Audi's journey from a Hadoop proof of concept to a multitenant enterprise platform, sharing lessons learned, the decisions Audi made, and how a number of use cases are implemented using the platform.

Tom Grey leads the solutions architecture team for EMEA at Google, which is charged with advancing the technical state of the art for cloud technologies and working alongside Google’s biggest customers to apply cloud technology in new ways to solve complex customer problems. Previously, Tom was an application architect and lead developer at IBM driving new technology into some of the company’s most complex outsourcing accounts. An unashamed geek, Tom still loves to tinker with Python whenever family life allows.

Presentations

Cloud and the golden age of data analytics (sponsored by Google Cloud) Keynote

The history of data analytics has been marked by an environment of scarcity. The way we approach data analytics is only just catching up. Tom Grey explains why we are on the cusp of a golden age of analytics and machine learning.

Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Big data at speed Session

Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed.

Democratizing data within your organization Session

Sure, you’ve got the best and fastest running SQL engine, but you’ve still got some problems: Users don’t know which tables exist or what they contain; sometimes bad things happen to your data, and you need to regenerate partitions but there is no tool to do so. Mark Grover and Deepak Tiwari explain how to make your team and your larger organization more productive when it comes to consuming data.

Haikal Pribadi is the founder and CEO of GRAKN.AI, the database for AI, which uses machine reasoning to handle and interpret complex data. Haikal and his team work on building Grakn, a knowledge graph data platform, and Graql, a knowledge-oriented graph query language that performs machine reasoning to simplify complex data processing for AI applications. GRAKN.AI was recently awarded product of the year for 2017 by the University of Cambridge Computer Lab. Haikal’s interest in the field began at the Monash Intelligent Systems Lab, where he built an open source driver for the Parallax Eddie Robot, which was then adopted by NASA. Haikal was also the youngest algorithm expert behind Quintiq’s optimization technology, which supports some of the world’s largest supply chain systems in transportation, retail, and logistics. He holds a master’s degree in AI from the University of Cambridge.

Presentations

Why knowledge graphs are important to finance Session

Haikal Pribadi explains why knowledge graphs (KGs) are important for AI systems in the finance sector and details how they are being used to detect and uncover new knowledge, specifically for risk analysis, fraud detection, and GDPR use cases.

Tom Harrison is digital transformation manager for housing services, neighborhoods, and the housing directorate at Hackney Council.

Presentations

Predicting rent arrears: Leveraging data science in the public sector Session

One major challenge to social housing is determining how best to target interventions when tenants fall behind on rent payments. Jonathan Leslie, Maryam Qurashi, and Tom Harrison discuss a recent project in which a team of data scientist trainees helped Hackney Council devise a more efficient, targeted strategy to detect and prioritize such situations.

Phil Harvey is a Senior CSA for data and AI at Microsoft. Passionate about data and people, he believes empathy is the key data skill. He’s also a big, beardy geek.

Presentations

Successful data cultures: Inclusivity, empathy, retention, and results Session

Our lives are being transformed by data, changing our understanding of work, play, and health. Every organization can take advantage of this resource, but something is holding us back: us. Kim Nilsson and Phil Harvey explain how to build a successful data culture that embeds data at the heart of every organization through people and delivers success through empathy, communication, and humanity.

Kaylea Haynes is a data scientist at Manchester-based data analytics service Peak, which helps companies grow revenue and profits using data and machine learning. Kaylea focuses on developing techniques for demand forecasting. She is a member of the Royal Statistical Society and co-organizes R Ladies Manchester. Kaylea holds a PhD in statistics and operational research from Lancaster University. Her thesis was titled “Detecting Abrupt Changes in Big Data.”

Presentations

The ins and outs of forecasting in a hire business Session

Deciding how much stock to hold is a challenge for hire businesses. There is a fine balance between holding enough stock to fulfill hires and not holding too much stock so that overall utilization is too low to achieve the return on investment. Kaylea Haynes shares a case study on forecasting the demand for thousands of assets across multiple locations.

Alvin Heib is a senior partner for enablement at Cloudera. An open source enthusiast, Alvin has a deep knowledge of OpenStack, SDN Open Contrail, and Cloudera’s big data ecosystem. He was an early adopter of Hadoop as a service and is a true believer that not only will the public cloud slowly take over the private cloud but also big data will natively sit in the cloud next to high-data-generating applications. Previously, Alvin was a transformation program portfolio manager.

Presentations

ClickFox: Customer journey analytics powered by OpenStack and Cloudera Session

Alvin Heib and Guy Leroux offer an overview of ClickFox, a platform able to cope with high-performance analytical needs, from bits and bytes to solving a customer needs, covering the platform's virtualization, big data, and analytical layers.

With more than 20 years working in the IT industry, Olaf has earned experiences as architect, developer, administrator, trainer and project manager in many different areas. Storing and processing huge amounts of data, was always a focal point of his work. At ORDIX AG, he is responsible for Big Data and Data Warehouse technologies. He has built up a team of Big Data consultants, created several training courses, speaks at conferences and regularly publishes technical articles.

Presentations

Fast analytics on fast data: Kudu as a storage layer for banking applications Findata

Olaf Hein explains how a large German bank relies on a Kudu-based data platform to speed up business processes. Olaf highlights key data access patterns and the system architecture and shares best practices and lessons learned using Kudu in development and operations.

Christine Henry is a product manager at IQVIA on a healthcare EMR data software platform and a volunteer at DataKind UK, where she consults with charities and nonprofits to find stories and insights in data. Christine is interested in the ethical and social impacts of data and technology and led the development of DataKind UK’s ethical principles for volunteers. Previously, Christine served in data-related roles in healthcare market forecasting, analysis, and market access. She holds a PhD in physical chemistry and a law degree.

Presentations

Executive Briefings: Killer robots and how not to do data science Session

Not a day goes by without reading headlines about the fear of AI or how technology seems to be dividing us more than bringing us together. DataKind UK is passionate about using machine learning and artificial intelligence for social good. Kate Vang and Christine Henry explain what socially conscious AI looks like and what DataKind is doing to make it a reality.

Jason Heo is a senior software engineer at Naver, where he develops analytics systems and graph databases for internal use. Previously, he worked at a number of startups. Jason helped MySQL become widely used in Korea and wrote a book on MySQL. Nowadays, he mainly uses Spark, Elasticsearch, Kudu, and Druid to build analytic systems.

Presentations

Web analytics at scale with Druid at Naver Session

Naver.com is the largest search engine in Korea, with a 70% share of the Korean search market, and it handles billions of pages and events everyday. Jason Heo and Dooyong Kim offer an overview of Naver's web analytics system, built with Druid.

Carsten Herbe is a big data architect at Audi Business Innovation GmbH, a subsidiary of Audi focused on developing new mobility services and innovative IT solutions, where he is helping build a big data platform based on Hadoop and Kafka and as an solution architect, is responsible for developing and running the first analytical applications on that platform. Carsten has more than 10 years’ experience delivering data warehouse and BI solutions as well as big data infrastructure and solutions.

Presentations

Audi's journey to an enterprise big data platform Session

Carsten Herbe and Matthias Graunitz detail Audi's journey from a Hadoop proof of concept to a multitenant enterprise platform, sharing lessons learned, the decisions Audi made, and how a number of use cases are implemented using the platform.

Louise Herring is a partner at McKinsey & Company’s London office and a leader of the company’s Analytics and Digital practices, where she supports consumer-facing companies in the UK and Europe as they transform to tackle both the challenges and opportunities created through digital and data. Recognizing that true transformation takes more than landing a distinctive use case, Louise focuses on building clients’ analytical skills. She is European lead for McKinsey’s Analytics Academy, which supports organizations with their capability building so that they can thrive by accelerating their talent advantage.

Presentations

Executive Briefing: Artificial intelligence—The next digital frontier? Session

After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we’re still early in the cycle of adoption. Louise Herring explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge.

Mick leads Cloudera’s worldwide marketing efforts, including advertising, brand, communications, demand, partner, solutions, and web. Mick has had a successful 25-year career in enterprise and cloud software. Previously, he was CMO at sales acceleration and machine learning company InsideSales.com, helping the company pioneer a shift to data-driven marketing and sales that has served as a model for organizations around the globe; served as global vice president of marketing and strategy at Citrix, where he led the company’s push into the high-growth desktop virtualization market; managed executive marketing at Microsoft; and held numerous leadership positions at IBM Software. Mick is an advisory board member for InsideSales and a contributing author on Inc.com. He’s also an accomplished public speaker who has shared his insightful messages about the business impact of technology with audiences around the world. Mick holds a BS in management from the Georgia Institute of Technolgy.

Presentations

Charting a data journey to the cloud Keynote

What happens when you combine near-limitless data with on-demand access to powerful analytics and compute? For Deutsche Telekom, the results have been transformative. Mick Hollison, Sven Löffler, and Robert Neumann explain how Deutsche Telekom is harnessing machine learning and analytics in the cloud to build Europe’s largest and most powerful IoT data marketplace.

Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it Session

Mick Hollison shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.

Shant Hovsepian is a cofounder and CTO of Arcadia Data, where he’s responsible for the company’s long-term innovation and technical direction. Previously, Shant was an early member of the engineering team at Teradata, which he joined through the acquisition of Aster Data, and he interned at Google, where he worked on optimizing the AdWords database. His experience includes everything from Linux kernel programming and database optimization to visualization. He started his first lemonade stand at the age of four and ran a small IT consulting business in high school. Shant studied computer science at UCLA, where he had publications in top-tier computer systems conferences.

Presentations

Ask Me Anything: Architecting a data platform for enterprise use Session

Join Mark Madsen and Shant Hovsepian to discuss analytics strategy and planning, data architecture, data management, and BI on big data.

Executive Briefing: BI on big data Session

If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian discuss the trade-offs between a number of architectures that provide self-service access to data.

Alison Howard is an assistant general counsel at Microsoft, where she supports the company’s efforts around the EU’s General Data Protection Regulation and provides privacy advice for Microsoft’s identity and security services and enterprise and developer products and online services. Previously, she practiced media and IP law in Seattle and was a journalist at newspapers in California and Idaho. Alison holds a JD from the University of California, Berkeley, an MA in international journalism from the University of Southern California, and a BA in journalism from the University of North Carolina at Chapel Hill. She also holds CIPP/E and CIPP/US certifications.

Presentations

Journey to GDPR compliance Keynote

May 25, the day the GDPR goes into effect, is an important milestone for data protection in the EU and elsewhere, but the journey to GDPR compliance neither begins nor ends there. Alison Howard explains how Microsoft, one of the world’s largest companies, with operations across the EU and around the globe, has prepared for May 25 and beyond.

Paul Ibberson is a senior architect within Teradata’s international ecosystem architecture CoE, where he helps leading organizations to get the most value out of the latest advances in data warehousing and big data landscape. Paul has worked with Teradata technology for over 20 years, covering most aspects of building data warehouses and the broader analytical ecosystem, and has worked with clients from Nordic countries, Benelux, Germany, Austria, Australia, South Africa, and the UK. His previous roles and responsibilities at Teradata have included supporting pre- and postsales activities in the financial services, utilities, oil and gas, and government industries and leading teams to deliver greenfield customer implementations, from shaping requirements to designing, building, and implementing them as services. Paul is a frequent speaker at technical and customer conferences around the world. He holds a BSc in computer science and engineering from Manchester Metropolitan University, UK. He was awarded the Teradata Consulting Excellence Award in 2013.

Presentations

Driving better predictions in the oil and gas industry with modern data architecture DCS

Oil exploration and production is technically challenging, and exploiting the associated data brings its own difficulties. Jane McConnell and Paul Ibberson share best practices and lessons learned helping oil companies modernize their data architecture and plan the IT/OT convergence required to benefit from full digitalization.

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He’s a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he’s an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Solving data cleaning and unification using human-guided machine learning Session

Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas provides insight into various techniques and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.

Martha Imprialou is a data scientist and a specialist consultant in client projects at QuantumBlack, where she leads analytics work, designs and develops machine learning and statistical models, and works with pharma and healthcare clients, focusing on real-world patient outcomes to optimize commercial performance. Martha has extensive experience in healthcare analytics using genomics and biomedical and clinical data. Previously, Martha was a postdoctoral researcher at Imperial College London, where her research involved analyzing DNA sequencing data to understand the genetics of autoimmune diseases. She has published in five genetics and computer science journals, four of them as first author, and presented at 10+ conferences.

Presentations

Interpretable AI: Can we trust machine learning? Session

Konstantinos Georgatzis and Martha Imprialou explain how to interpret the predictions given by your black-box model and how machine learning is helping to drive decision making today.

Jean Innes is Director of Transformation and Strategy at ASI. Before joining ASI Jean was Director of Consumer Data at Rightmove, the UK’s top online property search website, where she identified the objectives and then built a team to deliver the company’s first machine learning capability, delivering new commercial tools from terabytes of unstructured data. Jean worked at Amazon as Head of Commercial Relationships in the UK retail business, and also has experience in the public sector at HM Treasury. Jean is advisor to the board of HouseMark, with a focus on data and technology.

Presentations

Data science for managers 2-Day Training

Jean Innes, Matthew Ward, Emanuele Haerens, and Alli Paget lead a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Data science for managers (Day 2) Training Day 2

The instructors offer a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Rashed Iqbal is a program manager for data solutions at Teledyne Technologies in California as well as an adjunct professor in the Economics Department at UCLA, where he teaches graduate courses in data science. Rashed also teaches a course on deep learning and natural language processing and understanding at UC Irvine. His current area of research is narrative economics, which studies the impact of popular narratives and stories on economic fluctuations. He believes narrative extraction will revolutionize process of human communication. His other areas of interest and expertise include data science, machine learning, and transitioning traditional organizations to Agile and Lean methods. Rashed has led multiple entrepreneurial ventures in these areas. He holds a PhD in systems engineering with a focus on stochastic and predictive systems as well as current CSM, CSP, PMI-ACP, and PMP certifications. He is a senior member of the IEEE.

Presentations

Narrative extraction: Analyzing the world’s narratives through natural language understanding Session

Narratives are significant vectors of rapid change in culture, economic behavior, and the Zeitgeist of a society. Narrative economics studies the impact of popular human-interest stories on economic fluctuations. Naveed Ghaffar and Rashed Iqbal outline a framework that uses natural language understanding to extract and analyze narratives in human communication.

Kinnary Jangla is a senior software engineer on the home feed team at Pinterest, where she works on the machine learning infrastructure team as a backend engineer. Kinnary has worked in the industry for 10+ years. Previously, she worked on maps and international growth at Uber and on Bing search at Microsoft. She is the author of two books. Kinnary holds an MS in computer science from the University of Illinois and a BE from the University of Mumbai.

Presentations

Accelerating development velocity of production ML systems with Docker Session

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of its ML teams while increasing uptime and ease of deployment.

Jeroen Janssens is the founder, CEO, and an instructor of Data Science Workshops, which provides on-the-job training and coaching in data visualization, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He’s the author of Data Science at the Command Line (O’Reilly). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Presentations

50 reasons to learn the shell for doing data science Session

"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems.

Dan Jeavons is general manager of Shell’s Advanced Analytics Center of Excellence within the company’s central CIO Office, where he is part of a small team tasked with developing business architecture across the group, with specific responsibilities for overseeing the implementation of a single enterprise process model. Previously, Dan was the consulting and training lead for the newly formed Process Design Centre of Excellence, where he was responsible for rolling out a standard approach for business process analysis and design across the group, as well as integrating this approach with other disciplines such as process improvement and data architecture. Prior to Shell, Dan was at Accenture, where he focused on standardizing processes from procurement through to credit risk management, gaining experience in the oil and gas business. He graduated from Oxford University with first-class honors.

Presentations

Analytics-driven insights are the new oil: How Shell is transforming data science DCS

As the world moves to a lower hydrocarbon future, Shell is using data to drive its own organizational change. Analytics and the transformation of a skilled analyst workforce into the next generation of citizen data scientists are the keys to Shell’s growth. Dan Jeavons details Shell’s journey to data analytics excellence.

Flavio Junqueira is senior director of software engineering at Dell EMC, where he leads the Pravega team. He is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, Flavio held an engineering position with Confluent and research positions with Yahoo Research and Microsoft Research. He is an active contributor to Apache projects, including Apache ZooKeeper (as PMC and committer), Apache BookKeeper (as PMC and committer), and Apache Kafka. Flavio coauthored the O’Reilly ZooKeeper book. He holds a PhD in computer science from the University of California, San Diego.

Presentations

Stream scaling in Pravega Session

Stream processing is in the spotlight. Enabling low-latency insights and actions out of continuously generated data is compelling to a number of application domains, and the ability to adapt to workload variations is critical to many applications. Flavio Junqueira explores Pravega, a stream store that scales streams automatically and enables applications to scale downstream by signaling changes.

Eva Kaili is a member of the European Parliament and the chair of the European Parliament’s Science and Technology Options Assessment body (STOA), where she has been working intensively on promoting innovation as a driving force of the establishment of the European digital single market. Eva has been particularly active in the fields of blockchain technology, m- and e-health, big data, fintech, and cybersecurity as well as taxation, where she has been the rapporteur of the ECON committee’s annual tax report. And as a member of the ECON committee, she has been focusing on EU’s financial integration and the management of the financial crisis in the Eurozone. Previously, Eva was twice elected to the Greek Parliament, where she served between 2007 and 2012, with the PanHellenic Socialist Movement (PASOK). She holds a bachelor’s degree in architecture and civil engineering and a postgraduate degree in European politics. Currently, she working toward a PhD in international political economy.

Presentations

Data protection and innovation Keynote

Keynote with Eva Kaili

Mirko Kämpf is a solutions architect on the CEMEA team at Cloudera, where he applies tools from the Hadoop ecosystem, such as Spark, HBase, and Solr, to solve customer’s problems and is working on graph-based knowledge representation using Apache Jena to enable semantic search at scale. Mirko’s research focuses on time-dependent networks and time series analysis at scale. He loves to deliver data-centric workshops and has spoken at several big data-related conferences and meetups. He holds a PhD in statistical physics.

Presentations

Improving computer vision models at scale Session

Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. Marton Balassi, Mirko Kämpf, and Jan Kunigk share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable.

Amit Kapoor is a data storyteller at narrativeViz, where he uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Interested in learning and teaching the craft of telling visual stories with data, Amit also teaches storytelling with data for executive courses as a guest faculty member at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. Previously, he gained more than 12 years of management consulting experience with A.T. Kearney in India, Booz & Company in Europe, and startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad.

Presentations

Architectural design for interactive visualization Session

Creating visualizations for data science requires an interactive setup that works at scale. Bargava Subramanian and Amit Kapoor explore the key architectural design considerations for such a system and discuss the four key trade-offs in this design space: rendering for data scale, computation for interaction speed, adapting to data complexity, and being responsive to data velocity.

Deep learning in the browser: Explorable explanations, model inference, and rapid prototyping Session

Amit Kapoor and Bargava Subramanian lead three live demos of deep learning (DL) done in the browser—building explorable explanations to aid insight, building model inference applications, and rapid prototyping and training an ML model—using the emerging client-side JavaScript libraries for DL.

Manas Ranjan Kar is a Associate Vice President at US healthcare company Episource, where he leads the NLP and data science practice, works on semantic technologies and computational linguistics (NLP), builds algorithms and machine learning models, researches data science journals, and architects secure product backends in the cloud. He’s architected multiple commercial NLP solutions in the area of healthcare, food and beverages, finance, and retail. Manas is deeply involved in functionally architecting large-scale business process automation and deep insights from structured and unstructured data using NLP and ML. He’s contributed to NLP libraries including gensim and Conceptnet 5 and blogs regularly about NLP on forums like Data Science Central, LinkedIn, and his blog Unlock Text. Manas speaks regularly about NLP and text analytics at conferences and meetups, such as PyCon India and PyData, has taught hands-on sessions at IIM Lucknow and MDI Gurgaon, and has mentored students from schools including ISB Hyderabad, BITS Pilani, and the Madras School of Economics. When bored, he falls back on Asimov to lead him into an alternate reality.

Presentations

Building a healthcare decision support system for ICD10/HCC coding through deep learning Session

Episource is building a scalable NLP engine to help summarize medical charts and extract medical coding opportunities and their dependencies to recommend best possible ICD10 codes. Manas Ranjan Kar offers an overview of the wide variety of deep learning algorithms involved and the complex in-house training-data creation exercises that were required to make it work.

Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Ilia Karmanov is a data scientist working on applying machine learning and deep learning solutions in industry. He is particularly interested in the statistical theory behind deep learning. Ilia holds an MSc in economics from the London School of Economics.

Presentations

Distributed training of deep learning models Session

Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for.

Geordie Kaytes is the director of UX strategy for Boston-area UI/UX studio Fresh Tilled Soil and a partner at Heroic, a design leadership coaching firm that helps growing companies scale their digital product capabilities. A digital product design leader with deep experience in design process transformation and cross-functional expertise in design, strategy, and technology, Geordie has helped companies in a broad range of industries develop a 360-degree view of their product design processes. Previously, he did his obligatory tour of duty in management consulting. He holds a BA from Yale in political science. He is a coauthor of the Medium publication Radical Product.

Presentations

Measure what matters: How your measurement strategy can reduce opex Tutorial

These days it’s easy for companies to say, "We measure everything!” The problem is, most popular metrics may not be appropriate or relevant for your business. Measurement isn’t free and should be done strategically. Radhika Dutt, Geordie Kaytes, and Nidhi Aggarwal explain how to align measurement with your product strategy so you can measure what matters for your business.

Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Correlation analysis on live data streams Session

The rate of growth of data volume and velocity has been accelerating along with increases in the variety of data sources. This poses a significant challenge to extracting actionable insights in a timely fashion. Arun Kejariwal and Francois Orsini explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.

Modern real-time streaming architectures Tutorial

The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. Karthik Ramasamy, Arun Kejariwal, and Ivan Kelly walk you through state-of-the-art streaming frameworks, algorithms, and architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Ivan Kelly is a software engineer at Streamlio, a startup dedicated to providing a next-generation integrated real-time stream processing solution, based on Heron, Apache Pulsar (incubating), and Apache BookKeeper. Ivan has been active in Apache BookKeeper since its very early days as a project in Yahoo! Research Barcelona. Specializing in replicated logging and transaction processing, he is currently focused on Streamlio’s storage layer.

Presentations

Modern real-time streaming architectures Tutorial

The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. Karthik Ramasamy, Arun Kejariwal, and Ivan Kelly walk you through state-of-the-art streaming frameworks, algorithms, and architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Multi-data center and multitenant durable messaging with Apache Pulsar Session

Ivan Kelly offers an overview of Apache Pulsar, a durable, distributed messaging system, underpinned by Apache BookKeeper, that provides the enterprise features necessary to guarantee that your data is where is should be and only accessible by those who should have access.

Presentations

Operationalizing live data to benefit business (sponsored by WANdisco) Session

Today, every company is a data company. Business success depends on putting large volumes of live data to work to drive competitive advantage. Paul Phillips details how some of the world’s largest companies have achieved 100% uptime while moving massive live datasets and halving their hardware requirements.

Dooyong Kim is a software engineer at Naver, where he has been working on building a Spark- and Druid-based OLAP platform. Previously, he was a search engineer at ecommerce search platform Coupang, where he implemented several Apache Solr search infrastructure-related projects and researched a Spark and Solr integrated indexing mechanism. Dooyong is currently interested in MPP and advanced file formats for big data processing.

Presentations

Web analytics at scale with Druid at Naver Session

Naver.com is the largest search engine in Korea, with a 70% share of the Korean search market, and it handles billions of pages and events everyday. Jason Heo and Dooyong Kim offer an overview of Naver's web analytics system, built with Druid.

Eugene Kirpichov is a staff software engineer on the Cloud Dataflow team at Google, where he works on the Apache Beam programming model and APIs. Previously, Eugene worked on Cloud Dataflow’s autoscaling and straggler elimination techniques. He is interested in programming language theory, data visualization, and machine learning.

Presentations

Radically modular data ingestion APIs in Apache Beam Session

Apache Beam offers users a novel programming model in which the classic batch-streaming dichotomy is erased and ships with a rich set of I/O connectors to popular storage systems. Eugene Kirpichov explains why Beam has made these connectors flexible and modular—a key component of which is Splittable DoFn, a novel programming model primitive that unifies data ingestion between batch and streaming.

Kostas Kloudas is a software engineer at data Artisans and a Flink committer working to make Apache Flink the best open source stream processing engine (and your data’s best friend). Previously, Kostas was a postdoctoral researcher at IST in Lisbon. He holds a PhD in computer science from Inria in France and an engineering diploma from NTUA in Athens, where his main research focus was cloud storage and distributed processing.

Presentations

Complex event processing with Apache Flink Session

Complex event processing (CEP) helps detect patterns over continuous streams of data. DNA sequencing, fraud detection, shipment tracking with specific characteristics (e.g., contaminated goods), and user activity analysis fall into this category. Kostas Kloudas offers an overview of Flink's CEP library and explains the benefits of its integration with Flink.

Jorie Koster-Hale is a lead data scientist at Dataiku with expertise in neuroscience, healthcare data, and machine learning. Previously, she was a postdoctoral fellow at Harvard. Jorie holds a PhD in cognitive neuroscience from the Massachusetts Institute of Technology. She currently resides in Paris, where she builds predictive models and eats pain au chocolat.

Presentations

Rent, rain, and regulations: Leveraging structure in big data to predict criminal activity Session

Because crime is affected by a number of geospatial and temporal features, predicting crime poses a unique technical challenge. Jorie Koster-Hale shares an approach using a combination of open source data, machine learning, time series modeling, and geostatistics to determine where crime will occur, what predicts it, and what we can do to prevent it in the future.

Aljoscha Krettek is a cofounder and software engineer at Ververica. Previously, he worked at IBM Germany and at the IBM Almaden Research Center in San Jose. Aljoscha is a PMC member at Apache Beam and Apache Flink, where he mainly works on the streaming API and designed and implemented the most recent additions to the windowing and state APIs. He studied computer science at TU Berlin.

Presentations

Stream processing for the practitioner: Blueprints for common stream processing use cases with Apache Flink Session

Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink.

Jan Kunigk has worked on enterprise Hadoop solutions since 2010. Before joining Cloudera in 2014, his tasks included building optimized systems architectures for Hadoop at IBM and implementing a Hadoop-as-a-service offering at T-Systems. In his current role as a Solutions Architect he makes Hadoop projects at Cloudera’s enterprise customers successful, covering a wide spectrum of architectural decisions to the implementation of big data applications across all industry sectors on a day-to-day basis.

Presentations

Improving computer vision models at scale Session

Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. Marton Balassi, Mirko Kämpf, and Jan Kunigk share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. He specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. In addition to his client-facing consulting and training, Jared is an adjunct professor of statistics at Columbia University and the organizer of the New York Open Statistical Programming Meetup and the New York R Conference. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world and was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Modeling time series in R Session

Temporal data is being produced in ever-greater quantity, but fortunately our time series capabilities are keeping pace. Jared Lander explores techniques for modeling time series, from traditional methods such as ARMA to more modern tools such as Prophet and machine learning models like XGBoost and neural nets. Along the way, Jared shares theory and code for training these models.

Martha Lane Fox is founder and executive chair of Doteveryone.org.uk, a charity fighting for a fairer internet. Over her career, Martha has served as a nonexecutive director of Twitter, a cross-bench peer in the UK House of Lords, a member of the Joint Committee on National Security Strategy, chancellor of the Open University, and nonexecutive director of the Baileys Women’s Prize for Fiction and the Scale Up Institute. She was digital champion for the UK and helped to create the Government Digital Service. She also cofounded and chairs LuckyVoice, a company revolutionizing the karaoke industry, and cofounded with Brent Hoberman Lastminute.com, Europe’s largest travel and leisure website. Martha is a patron of AbilityNet, Reprieve, Camfed, and Just for Kids Law. She has been awarded a CBE.

Presentations

The good, the bad, and the internet? Keynote

Keynote with Martha Lane Fox

Michael Lanzetta is a principal SDE at Microsoft. In his more than 20-year career in the software industry, he’s worked in domains as varied as circuit design and drug discovery and in languages from JavaScript to F#, but his primary focus has always been scaled-out server-side work. Michael has a background in demand forecasting from Manugistics and Amazon and machine learning from Bing; he has spent the last few years building intelligent services on Azure using everything from Spark to TensorFlow and CNTK.

Presentations

Detecting small-scale mines in Ghana Session

Michael Lanzetta and Elena Terenzi offer an overview of a collaboration between Microsoft and the Royal Holloway University that applied deep learning to locate illegal small-scale mines in Ghana using satellite imagery, scaled training using Kubernetes, and investigated the mines' impact on surrounding populations and environment.

Paul Lashmet is practice lead and advisor for financial services at Arcadia Data, a company that provides visual big data analytics software that empowers business users to glean meaningful and real-time business insights from high-volume and varied data in a timely, secure, and collaborative way. Paul writes extensively about the practical applications of emerging and innovative technologies to regulatory compliance. Previously, he led programs at HSBC, Deutsche Bank, and Fannie Mae.

Presentations

Real-time trade surveillance is not just about trade data Findata

To fully demonstrate that policies and procedures comply with regulatory requirements, financial services organizations must use information that goes beyond traditional data sources. Paul Lashmet explains how alternative data sources enhance trade surveillance by providing a deeper understanding of the intent of trade activities.

Francesca Lazzeri is a senior machine learning scientist at Microsoft on the cloud advocacy team and an expert in big data technology innovations and the applications of machine learning-based solutions to real-world problems. Her research has spanned the areas of machine learning, statistical modeling, time series econometrics and forecasting, and a range of industries—energy, oil and gas, retail, aerospace, healthcare, and professional services. Previously, she was a research fellow in business economics at Harvard Business School, where she performed statistical and econometric analysis within the technology and operations management unit. At Harvard, she worked on multiple patent, publication and social network data-driven projects to investigate and measure the impact of external knowledge networks on companies’ competitiveness and innovation. Francesca periodically teaches applied analytics and machine learning classes at universities and research institutions around the world. She’s a data science mentor for PhD and postdoc students at the Massachusetts Institute of Technology and speaker at academic and industry conferences—where she shares her knowledge and passion for AI, machine learning, and coding.

Presentations

Operationalize deep learning models for fraud detection with Azure Machine Learning Workbench Session

Advancements in computing technologies and ecommerce platforms have amplified the risk of online fraud, which results in billions of dollars of loss for the financial industry. This trend has urged companies to consider AI techniques, including deep learning, for fraud detection. Francesca Lazzeri and Jaya Mathew explain how to operationalize deep learning models with Azure ML to prevent fraud.

Guy Leroux is product manager for data lake solutions at Atos. He is passionate about providing better solutions for IT organizations, particularly for security and privacy.

Presentations

ClickFox: Customer journey analytics powered by OpenStack and Cloudera Session

Alvin Heib and Guy Leroux offer an overview of ClickFox, a platform able to cope with high-performance analytical needs, from bits and bytes to solving a customer needs, covering the platform's virtualization, big data, and analytical layers.

Randy Lea is chief revenue officer at Arcadia Data, where he is charged with leading the company’s sales momentum. Randy is passionate about solving customer problems by leveraging analytics and data. An early participant in the data warehouse and BI analytics market, he has held leadership positions at companies including Aster Data, Think Big Analytics, and Teradata. Randy holds a bachelor’s degree in marketing from California State University, Fullerton.

Presentations

A tale of two BI standards: Data warehouses and data lakes (sponsored by Arcadia Data) Session

Business intelligence (BI) and analytics on data lakes have had limited success. Data lakes often fall short because they are mostly used by data scientists and not by business users. Randy Lea explains why existing BI tools work well for data warehouses but not data lakes and why modern BI tools designed for data lakes should represent the second BI standard in enterprises today.

Mike Lee Williams is a research engineer at Cloudera Fast Forward Labs, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Cloudera’s customers understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Interpretable machine learning products Session

Interpretable models result in more accurate, safer, and more profitable machine learning products, but interpretability can be hard to ensure. Michael Lee Williams examines the growing business case for interpretability, explores concrete applications including churn, finance, and healthcare, and demonstrates the use of LIME, an open source, model-agnostic tool you can apply to your models today.

Jonathan Leslie is the head of data science at Pivigo, where he works with business partners to develop data science solutions that make the most of their data, including in-depth analysis of existing data and predictive analytics for future business needs. He also programs, mentors, and manages teams of data scientists on projects in a wide variety of business domains.

Presentations

Predicting rent arrears: Leveraging data science in the public sector Session

One major challenge to social housing is determining how best to target interventions when tenants fall behind on rent payments. Jonathan Leslie, Maryam Qurashi, and Tom Harrison discuss a recent project in which a team of data scientist trainees helped Hackney Council devise a more efficient, targeted strategy to detect and prioritize such situations.

Revolutionizing the newsroom with artificial intelligence Session

In the era of 24-hour news and online newspapers, editors in the newsroom must quickly and efficiently make sense of the enormous amounts of data that they encounter and make decisions about their content. Daniel Gilbert and Jonathan Leslie discuss an ongoing partnership between News UK and Pivigo in which a team of data science trainees helped develop an AI platform to assist in this task.

Federico Leven is the founder and lead data architect at ReactoData, a startup located in Buenos Aires, Argentina, and Wroclaw, Poland, focused on big data, advanced analytics applications and Hadoop. He also participates in the Open Compute Project doing benchmarks of big data frameworks, coordinates the big data meetups at IAAR, teaches a hands-on Hadoop lab at the Universidad De Palermo in Argentina, and is a frequent speaker at big data conferences (when he has a good idea to share). He began working with Hadoop in 2012 at Luminar Insights; previously, he was a data warehouse architect and Python developer.

Presentations

Hadoop under attack: Securing data in a banking domain Session

The apparent difficulty of managing Hadoop compared to more traditional and proprietary data products makes some companies wary of the Hadoop ecosystem, but managing security is becoming more accessible in the Hadoop space, particularly in the Cloudera stack. Federico Leven offers an overview of an end-to-end security deployment on Hadoop and the data and security governance policies implemented.

Tianhui Michael Li is the founder and president of the Data Incubator, a data science training and placement firm. Michael bootstrapped the company and navigated it to a successful sale to the Pragmatic Institute. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, JPMorgan, and D.E. Shaw. He’s a regular contributor to the Wall Street JournalTechCrunchWiredFast CompanyHarvard Business ReviewMIT Sloan Management ReviewEntrepreneurVentureBeat, TechTarget, and O’Reilly. Michael was a postdoc at Cornell, a PhD at Princeton, and a Marshall Scholar in Cambridge.

Presentations

Data, AI, and innovation in the enterprise Session

What are the latest initiatives and use cases around data and AI? How are data and AI reshaping industries? How do we foster a culture of data and innovation within a larger enterprise? What are some of the challenges of implementing AI within the enterprise setting? Michael Li moderates a panel of experts in different industries to answer these questions and more.

Ryan Lippert works in product marketing at Google, where he is responsible for developing and communicating Google’s vision for big data and analytics. Previously, Ryan served in a variety of roles at Cisco Systems and Cloudera. He holds an economics degree from the University of Guelph and an MBA from Stanford.

Presentations

Building the bridge from big data to machine learning and artificial intelligence (sponsored by Google Cloud) Session

If your company isn’t good at analytics, it’s not ready for AI. Ryan Lippert explains how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value.

Audrey Lobo-Pulo is the founder of Phoensight and has a passion for using emerging data technologies to empower individuals, governments, and organizations in creating a better society. Audrey has over 10 years’ experience working with the Australian Treasury in public policy areas including personal taxation, housing, social policy, labor markets, and population demographics. She’s an open government advocate and has a passion for open data and open models. She pioneered the concept of “government open source models,” which are government policy models open to the public to use, modify, and distribute freely. Audrey’s deeply interested in how technology enables citizens to actively participate and engage with their governments in cocreating public policy. She holds a PhD in physics and a master’s in economic policy.

Presentations

Leveraging public-private partnerships using data analytics for economic insights Session

In October 2017, LinkedIn and the Australian Treasury teamed up to gain a deeper understanding of the Australian labor market through new data insights, which may inform economic policy and directly benefit society. Audrey Lobo-Pulo and Nick O'Donnell share some of the discoveries from this collaboration as well as the practicalities of working in a public-private partnership.

Mathew Lodge is senior vice president of product and marketing at Anaconda. Mathew has well over 20 years’ diverse experience in cloud computing and product leadership. Previously, he was chief operating officer at container and microservices networking and management startup Weaveworks; vice president of VMware’s Cloud Services Group and cofounder of what became VMware’s vCloud Air IaaS service; and senior director of Symantec’s $1B+ Information Management Group. Early in his career, Mathew built compilers and distributed systems for projects like the International Space Station, helped connect six countries to the internet for the first time, managed a $630M router product line at Cisco, and attempted to do SDN 10 years too early at CPlane.

Presentations

Cloud-native data science with Anaconda, Docker, and Kubernetes (sponsored by Anaconda) Session

The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Mathew Lodge demonstrates that it's just as easy to deploy Python as it is Java, using containers and Kubernetes. Welcome to the future.

Sven Löffler is a business development executive for big data analytics at T-Systems, where he is responsible for identification and development of data analytics and data-driven solutions. Previously, he was an executive IT specialist (OpenGroup certified) and business development leader for IBM Watson and big data solutions. Over his 20-year career, he has held a number of sales support positions in Germany and Europe and has proven and extensive experience in the business intelligence, performance management, big data marketing and support services, technical sales, and marketing spaces.

Presentations

Charting a data journey to the cloud Keynote

What happens when you combine near-limitless data with on-demand access to powerful analytics and compute? For Deutsche Telekom, the results have been transformative. Mick Hollison, Sven Löffler, and Robert Neumann explain how Deutsche Telekom is harnessing machine learning and analytics in the cloud to build Europe’s largest and most powerful IoT data marketplace.

The Data Intelligence Hub: On-demand Hadoop resource provisioning in Europe’s Industrial Data Space using Cloudera Altus Session

Sven Löffler offers an overview of the Data Intelligence Hub, T-Systems's implementation of the Fraunhofer Industrial Data Space: a reference architecture for the standardized and secure data exchange between industries in the context of the internet of things.

Ben Lorica is the chief data scientist at O’Reilly. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Building a stronger data ecosystem Keynote

To enable the machine learning applications of the future, there remain many interesting and challenging data problems we need to tackle as a community. Ben Lorica discusses some of the pressing problems we're facing as we collect and store data, particularly in an era when our machine learning models require huge amounts of labeled data.

Thursday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Hollie Lubbock is an interaction design manager at Fjord, where she focuses on helping clients such as Google, Facebook, Net-a-Porter, the BBC, Southbank Centre, the V&A, Paul Smith, Wellcome Collection, the National Theatre, DeepMind, and Roald Dahl create products and services that excite audiences and drive engagement. Over the last 10 years, she has worked in the luxury, culture and publishing, and telco sectors, collaborating closely with clients to create user-centered designs for a wide range of digital products, from large-scale collections systems to small innovative app projects. Hollie also mentors early-stage startups as part of the Google Launchpad program and junior UX designers through UXPA. When not designing, she’s a keen traveler and Instagram addict.

Presentations

Designing ethical artificial intelligence Session

Artificial intelligence systems are powerful agents of change in our society, but as this technology becomes increasingly prevalent—transforming our understanding of ourselves and our society—issues around ethics and regulation will arise. Jivan Virdee and Hollie Lubbock explore how to address fairness, accountability, and the long-term effects on our society when designing with data.

Humanizing data: How to find the why DCS

Data has opened up huge possibilities for analyzing and customizing services. However, although we can now manage experiences to dynamically target audiences and respond immediately, context is often missing. Hollie Lubbock and Jivan Virdee share a practical approach to discovering the reasons behind the data patterns you see and help you decide what level of personalized service to create.

Boris Lublinsky is a principal architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, he was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Ask Me Anything: Streaming applications and architectures Session

Join Dean Wampler and Boris Lublinsky to discuss all things streaming: architecture, implementation, streaming engines and frameworks, techniques for serving machine learning models in production, traditional big data systems (dying or still relevant?), and general software architecture and data systems.

Kafka streaming microservices with Akka Streams and Kafka Streams Tutorial

Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Along the way, Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead.

Paul Lynn leads the product engineering team at Nordea, where he and his team are helping deliver Nordea’s next-generation data services and data platform. Previously, Paul spent nine years working at investment banks in London. Paul holds a MA in psychology from the University of Dublin, Trinity College, and an MPhil in social and development psychology from the University of Cambridge. Outside of work, he supports an educational charity, plays cello, and goes sailing.

Presentations

How Nordea reduced time to market by 85% with modern analytics Findata

As a global systemically important bank (G-SIB), Nordea Bank is subject to the highest level of regulatory oversight in the financial services industry. Paul Damien Lynn explains how implementing new technology and processes for acquiring, preparing, and analyzing data helped reduce the bank’s compliance reporting processes by over 85%.

Gerard Maas is a senior software engineer at Lightbend, where he contributes to the Fast Data Platform and focuses on the integration of stream processing technologies. Previously, he held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms, and scalable APIs. He is the coauthor of Stream Processing with Apache Spark from O’Reilly. Gerard is a frequent speaker and contributes to small and large open source projects. In his free time, he tinkers with drones and builds personal IoT projects.

Presentations

Processing fast data with Apache Spark: A tale of two APIs Session

Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences in key aspects of a streaming application, from the API user experience to dealing with time and with state and machine learning capabilities, and shares practical guidance on picking one or combining both to implement resilient streaming pipelines.

Mark Madsen is a Fellow at Teradata, where he’s responsible for understanding, forecasting, and defining analytics ecosystems and architectures. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning, and vendors on product management. Mark has designed analysis, machine learning, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Ask Me Anything: Architecting a data platform for enterprise use Session

Join Mark Madsen and Shant Hovsepian to discuss analytics strategy and planning, data architecture, data management, and BI on big data.

Executive Briefing: BI on big data Session

If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian discuss the trade-offs between a number of architectures that provide self-service access to data.

Steffen Maerkl is a systems engineer at Cloudera, where he is part of the global security and data governance specialization team supporting customers across the central EMEA region, with a strong focus on the automotive, manufacturing, and telco markets. Steffen has held a number of consulting and presales positions in the fields of data warehousing, business analytics, and big data at companies such as Cirquent/NTT Data and Oracle. He holds a BSc in business informatics from the Technical University of Munich.

Presentations

Securing and governing hybrid, cloud, and on-premises big data deployments, step by step Tutorial

Hybrid big data deployments present significant new security risks. Security admins must ensure a consistently secured and governed experience for end users and administrators across multiple workloads. Mark Donsky, Steffen Maerkl, and André Araujo share best practices for meeting these challenges as they walk you through securing a Hadoop cluster.

Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Big data at speed Session

Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed.

Baolong Mao is a big data platform development engineer at JD.com, where he works on the company’s big data platform and focuses on big data ecosphere. He is an open source developer, Alluxio PMC and contributor, and Hadoop contributor. He’s a fan of technology sharing and open source.

Presentations

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks Session

Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.

Dana Mastropole is a data scientist in residence at the Data Incubator and contributes to curriculum development and instruction. Previously, Dana taught elementary school science after completing MIT’s Kaufman teaching certificate program. She studied physics as an undergraduate student at Georgetown University and holds a master’s in physical oceanography from MIT.

Presentations

Machine learning with TensorFlow 2-Day Training

The TensorFlow library enables the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. Dana Mastropole details TensorFlow's capabilities through its Python interface.

Machine learning with TensorFlow (Day 2) Training Day 2

The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. Dana Mastropole details TensorFlow's capabilities through its Python interface.

Jaya Mathew is a senior data scientist on the artificial intelligence and research team at Microsoft, where she focuses on the deployment of AI and ML solutions to solve real business problems for customers in multiple domains. Previously, she worked on analytics and machine learning at Nokia and Hewlett Packard Enterprise. Jaya holds an undergraduate degree in mathematics and a graduate degree in statistics from the University of Texas at Austin.

Presentations

Operationalize deep learning models for fraud detection with Azure Machine Learning Workbench Session

Advancements in computing technologies and ecommerce platforms have amplified the risk of online fraud, which results in billions of dollars of loss for the financial industry. This trend has urged companies to consider AI techniques, including deep learning, for fraud detection. Francesca Lazzeri and Jaya Mathew explain how to operationalize deep learning models with Azure ML to prevent fraud.

Jane McConnell is a practice partner for oil and gas within Teradata’s Industrial IoT Group, where she shows oil and gas clients how analytics can provide strategic advantage and business benefits in the multimillions. Jane is also a member of Teradata’s IoT core team, where she sets the strategy and positioning for Teradata’s IoT offerings and works closely with Teradata Labs to influence development of products and services for the industrial space. Originally from an IT background, Jane has also done time with dominant market players such as Landmark and Schlumberger in R&D, product management, consulting, and sales. In one role or another, she has influenced information management projects for most major oil companies across Europe. She chaired the education committee for the European oil industry data management group ECIM, has written for Forbes, and regularly presents internationally at oil industry events. Jane holds a BEng in information systems engineering from Heriot-Watt University in the UK. She is Scottish and has a stereotypical love of single malt whisky.

Presentations

Driving better predictions in the oil and gas industry with modern data architecture DCS

Oil exploration and production is technically challenging, and exploiting the associated data brings its own difficulties. Jane McConnell and Paul Ibberson share best practices and lessons learned helping oil companies modernize their data architecture and plan the IT/OT convergence required to benefit from full digitalization.

Jude McCorry is director of business development at the Data Lab in Scotland, where she is responsible for delivering collaborative data science projects between industry and academia. Previously, Jude worked in the public sector on Data for Good projects like delayed discharges and safe homes. Jude has over 15 years’ experience in sales and marketing in the technology sector and has worked with B2B and public sector for companies like Dell, Firefly Communications, and Xnet Data Storage. Jude also worked at Edinburgh Napier University, where she set up the school’s commercial arm, the Edinburgh Institute, which provides leading-edge practice-based executive education to Scottish executives.

Presentations

Data Collaboratives Session

Jude McCorry and Mahmood Adil offer an overview of Data Collaboratives, a new form of collaboration beyond the public-private partnership model, in which participants from different sectors  exchange data, skills, leadership, and knowledge to solve complex problems facing children in Scotland and worldwide.

Patrick McFadin is the vice president of developer relations at DataStax, where he leads a team devoted to making users of DataStax products successful. Previously, he was chief evangelist for Apache Cassandra and a consultant for DataStax, where he helped build some of the largest and exciting deployments in production; a chief architect at Hobsons; and an Oracle DBA and developer for over 15 years.

Presentations

Time for a new relation: Going from RDBMS to a graph database Session

Graph databases are becoming mainstream. Patrick McFadin explains how to use the knowledge you have gained from your years of working with relational databases in this brave new world. There are many similarities but also some significant differences that can open up completely new use cases. If you're deciding whether to take the plunge into graph databases, this is the talk for you.

Shaun McGirr is the lead data scientist at Cox Automotive Data Solutions, where he spends most days developing new data products for the automotive industry. Shaun has been working with data in one way or another for about 15 years. He holds a PhD from the University of Michigan.

Presentations

Scaling data science (teams and technologies) Session

Cox Automotive is the world’s largest automotive service organization, which means it can combine data from across the entire vehicle lifecycle. Cox is on a journey to turn this data into insights. David Asboth and Shaun McGirr share their experience building up a data science team at Cox and scaling the company's data science process from laptop to Hadoop cluster.

Viola Melis is a data scientist at Typeform, an online platform for conversational data collection. Passionate about data analytics and problem solving, Viola has spent the last two years focusing on business problems such as pricing, defining marketing personas, and understanding user behavior. She studied mathematical engineering and statistics at the Politecnico di Milano.

Presentations

How Typeform's data and analytics team managed to embed its data scientists into cross-functional teams while maintaining their cohesion DCS

Typeform's data team is transitioning into a less centralized structure and embedding its data scientists inside product and business teams. Viola Melis details initiatives the team developed to ensure alignment and cohesion, discusses the journey through this challenging process, and shares lessons learned, best practices, and new processes that were established.

Alexander Melkonyan is a data engineer at BMW focusing on Hadoop architecture, distributed systems, Spark, and other components of the ecosystem. Previously, he was a big data engineer at a number of cloud service and IoT companies, where he worked on establishing Hadoop as the main data platform.

Presentations

Enabling data-driven development for autonomous driving at BMW (sponsored by BMW) Session

The development of autonomous driving cars requires the handling of huge amounts of data produced by test vehicles and solving a number of critical challenges specific to the automotive industry. Miha Pelko and Aleksandr Melkonyan outline these challenges and explain how BMW is overcoming them by adapting and reinventing existing big data solutions for autonomous driving.

Miriah Meyer is an associate professor in the School of Computing at the University of Utah, where she runs the Visualization Design Lab. Her research focuses on the design of visualization systems for helping analysts and researchers make sense of complex data. Miriah was named a University of Utah distinguished alumni, a TED fellow, and a PopTech science fellow and has been included on MIT Technology Review’s TR35 list of the top young innovators.

Presentations

Making data visual: A practical session on using visualization for insight Tutorial

Danyel Fisher and Miriah Meyer explore the human side of data analysis and visualization, covering operationalization, the process of reducing vague problems to specific tasks, and how to choose a visual representation that addresses those tasks. Along the way, they also discuss single views and explain how to link them into multiple views.

Grigorios Mingas is a data scientist in easyJet, where he applies machine learning and statistical techniques to tackle a big variety of business problems and drive profits and savings. He enjoys working with customers and contributing to the evolution and growth of his team and has an interdisciplinary background in Bayesian modeling and parallel computing. Grigorios holds a PhD in electronic and electrical engineering from Imperial College.

Presentations

Data science survival and growth within the corporate jungle: An easyJet case study Session

Because in-house data science teams work with a range of business functions, traditional data science processes are often too abstract to cope with the complexity of these environments. Alberto Rey Villaverde and Grigorios Mingas share case studies from easyJet that highlight some unpredictable hurdles related to requirements, data, infrastructure, and deployment and explain how they solved them.

Angelique Mohring is founder and CEO of GainX, which leverages AI and advanced network design theory to accelerate global transformation for business. Angelique believes the world has entered a state of digital and economic transformation the like of which we’ve never experienced before and in which we are gravely unprepared. Her combined 25 years of experience as an anthropologist and bioarchaeologist, Fortune 500 consultant, and technology executive has given her a deep understanding on how organizations must drive sustainable growth through innovation and transformation. Angelique is a sought-after speaker in the US, Canada, and Europe on topics that include AI and machine learning, the future of the enterprise, collaborative innovation, global cultural and digital transformation, the future of work, and uncovering the unspoken dependencies across economic ecosystems for companies and events such as the Financial Times, IBM, Goldman Sachs, Google, the Telegraph, Silicon Valley, Microsoft, RBS, Money 2020, Pandemonio, MaRS Verge, Finovate Europe, Dublin’s UpRise, Finovate NYC, Banking Disrupted, and the Future of Work as well as several academic institutions. Angelique is active politically, helping the individual understand why they matter in the Next Economy, why and how corporations must navigate the tsunami of change in the next decade, and the roles governments must play in economic stability.

Presentations

Using data flow and machine learning to measure real transformation in culture, capacity, and delivery Findata

Angelique Mohring explains what information flow means for financial services, covering the advantages and challenges and detailing how to measure real innovation and cultural transformation through the lens of data flow and nontraditional sources of data to accelerate both organizational capacity and delivery capabilities.

Fausto Morales is a data scientist at Arundo Analytics, where he works on product development and customer projects. Previously, Fausto worked at ExxonMobil on projects that included environmental remediation, product pricing, and water treatment process modeling. He holds a bachelor’s degree in civil engineering from MIT.

Presentations

Real-time motorcycle racing optimization DCS

In motorcycle racing, riders make snap decisions that determine outcomes spanning from success to grievous injury. Fausto Morales and Marty Cochrane explain how they use a custom software-based edge agent and machine learning to automate real-time maneuvering decisions in order to limit tire slip during races, thereby mitigating risk and enhancing competitive advantage.

Calum Murray is the chief data architect in the Small Business Group at Intuit. Calum has 20 years’ experience in software development, primarily in the finance and small business spaces. Over his career, he has worked with various languages, technologies, and topologies to deliver everything from real-time payments platforms to business intelligence platforms.

Presentations

Machine learning at Intuit: Five delightful use cases Session

Machine learning-based applications are becoming the new norm. Calum Murray shares five use cases at Intuit that use the data of over 60 million users to create delightful experiences for customers by solving repetitive tasks, freeing them up to spend time more productively or solving very complex tasks with simplicity and elegance.

Mikheil Nadareishvili is deputy head of BI at TBC Bank, in charge of company-wide data science initiative. His main responsibilities include overseeing development of data science capability and embedding it in business to achieve maximum business value. Previously, Mikheil applied data science to various domains, most notably real estate (to determine housing market trends and predict real estate prices) and education (to determine factors that influence students’ educational attainment in Georgia).

Presentations

A data-driven journey to customer-centric banking Findata

Over the last three years, TBC Bank has transitioned from a product-centric approach to a client-centric one, which has included the development of advanced customer analytics. Mikheil Nadareishvili discusses this transition and explains how the company implemented an integrated 360-degree view of customer and advanced analytics, enabling personalized service.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Setting up a lightweight distributed caching layer using Apache Arrow Session

Jacques Nadeau offers an overview of a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture, learn how data science, analytical, and custom applications can all leverage the cache simultaneously, and see a live demo.

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.

Presentations

Human in the loop: A design pattern for managing teams working with machine learning Session

Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media.

Allison Nau is head of data solutions at Cox Automotive UK. Allison is a highly driven and self-motivated big data, analytics, and product executive with a proven track record in transforming businesses and driving strategic growth through data analysis and product development. Previously, Allison worked at LexisNexis, where she developed the entire product portfolio of data and analytics products for its expansion into the UK, leading to double-digit growth year on year for that new venture while transforming the motor insurance industry. A trained quantitative political scientist who got her start as a price optimization consultant, Allison holds a BA in mathematics and international relations from the College of Wooster and an MA in political science from the University of Michigan.

Presentations

Extracting value from data: How Cox Automotive is using data to drive growth and transform the way the world buys, sells, and owns cars DCS

Two and a half years into its data journey, Cox Automotive has realized significant benefits by harnessing the power of data, both internally and in the development of data solutions to improve decision making within the automotive industry. Allison Nau discusses Cox's transformation into a data-driven organization, from mobilizing data for production to making changes to its corporate culture.

Robert Neumann is founder and CEO of Ultra Tendency. He has more than a decade of experience designing and developing Hadoop-based applications and has been using Spark since v0.9.

Presentations

Charting a data journey to the cloud Keynote

What happens when you combine near-limitless data with on-demand access to powerful analytics and compute? For Deutsche Telekom, the results have been transformative. Mick Hollison, Sven Löffler, and Robert Neumann explain how Deutsche Telekom is harnessing machine learning and analytics in the cloud to build Europe’s largest and most powerful IoT data marketplace.

Elastic map matching using Cloudera Altus and Apache Spark Session

Map-matching applications exist in almost every telematics use case and are therefore crucial to all car manufacturers. Timo Graen and Robert Neumann detail the architecture behind Volkswagen Commercial Vehicle’s Altus-based map-matching application and lead a live demo featuring a map matching job in Altus.

Kim Nilsson is the CEO of Pivigo, a London-based data science marketplace and training provider responsible for S2DS, Europe’s largest data science training program, which has by now trained more than 650 fellows working on over 200 commercial projects with 120+ partner companies, including Barclays, KPMG, Royal Mail, News UK, and Marks & Spencer. An ex-astronomer turned entrepreneur with a PhD in astrophysics and an MBA, Kim is passionate about people, data, and connecting the two.

Presentations

Successful data cultures: Inclusivity, empathy, retention, and results Session

Our lives are being transformed by data, changing our understanding of work, play, and health. Every organization can take advantage of this resource, but something is holding us back: us. Kim Nilsson and Phil Harvey explain how to build a successful data culture that embeds data at the heart of every organization through people and delivers success through empathy, communication, and humanity.

Michael Noll is the technologist of the office of the CTO at Confluent, the company founded by the creators of Apache Kafka. Previously, Michael was the technical lead of DNS operator Verisign’s big data platform, where he grew the Hadoop, Kafka, and Storm-based infrastructure from zero to petabyte-sized production clusters spanning multiple data centers—one of the largest big data infrastructures in Europe at the time. He’s a well-known tech blogger in the big data community. In his spare time, Michael serves as a technical reviewer for publishers such as Manning and is a frequent speaker at international conferences, including Strata, ApacheCon, and ACM SIGIR. Michael holds a PhD in computer science.

Presentations

Unlocking the world of stream processing with KSQL, the streaming SQL engine for Apache Kafka Session

Michael Noll offers an overview of KSQL, the open source streaming SQL engine for Apache Kafka, which makes it easy to get started with a wide range of real-time use cases, such as monitoring application behavior and infrastructure, detecting anomalies and fraudulent activities in data feeds, and real-time ETL.

Erik Nordström is a senior software engineer at Timescale, where he focuses on both core database and infrastructure services. Previously, he worked on Spotify’s backend service infrastructure and was a postdoc and research scientist at Princeton, where he focused on networking and distributed systems, including a new end-host network stack for service-centric networking. Erik holds an MSc and PhD from Uppsala in Sweden.

Presentations

A heretical monitoring view: Using PostgreSQL to store Prometheus metrics and visualizing them in Grafana Session

Erik Nordström explains how and why to use PostgreSQL as a Prometheus backend to support complex questions (and get a proper SQL interface), offers an overview of pg_prometheus, a custom Prometheus datatype, and prometheus-postgresql-adapter, a remote storage adaptor for PostgreSQL, and shares his experience with TimescaleDB, which enables PostgreSQL to scale for classic monitoring volumes.

Nick O’Donnell is LinkedIn’s director of public policy and government affairs for the Asia Pacific region, where he leads the company’s efforts to build productive partnerships with governments, decision makers, and policy influencers throughout the region. His role includes policy and political outreach, government-focused data-sharing projects, work on technology policy issues, and the development of workforce and education policy solutions that are at the core of LinkedIn’s corporate mission and its overarching vision of creating economic opportunity for every member of the global workforce. Previously, Nick was legal counsel for Seven West Media and head of public policy in Asia at Yahoo. Nick has served as the chair and treasurer of the Asia Internet Coalition, leading joint-industry advocacy across Asia and locally as a committee member of the Communications and Media Law Association. He holds a combined bachelor of media and bachelor of laws and a master’s degree in media, information technology, and communications law.

Presentations

Leveraging public-private partnerships using data analytics for economic insights Session

In October 2017, LinkedIn and the Australian Treasury teamed up to gain a deeper understanding of the Australian labor market through new data insights, which may inform economic policy and directly benefit society. Audrey Lobo-Pulo and Nick O'Donnell share some of the discoveries from this collaboration as well as the practicalities of working in a public-private partnership.

Brian O’Neill is the founder and consulting product designer at Designing for Analytics, where he focuses on helping companies design indispensable data products that customers love. Brian’s clients and past employers include Dell EMC, NetApp, TripAdvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JPMorgan Chase, the Future of Music Coalition, and E*TRADE, among others; over his career, he has worked on award-winning storage industry software for Akorri and Infinio. Brian has been designing useful, usable, and beautiful products for the web since 1996. Brian has also brought over 20 years of design experience to various podcasts, meetups, and conferences such as the O’Reilly Strata Conference in New York City and London, England. He is the author of the Designing for Analytics Self-Assessment Guide for Non-Designers as well numerous articles on design strategy, user experience, and business related to analytics. Brian is also an expert advisor on the topics of design and user experience for the International Institute for Analytics. When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage performing as a professional percussionist and drummer. He leads the acclaimed dual-ensemble Mr. Ho’s Orchestrotica, which the Washington Post called “anything but straightforward,” and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival. If you’re at a conference, just look for only guy with a stylish orange leather messenger bag.

Presentations

The business leader’s guide to designing indispensable analytics solutions and data products Session

Gartner says 85%+ of big data projects will fail. Your own company may have even spent millions on a recent project that isn’t really delivering the value or UX everyone hoped for. Brian O'Neill explains why CDOs, PMs, and business leaders who leverage design to prioritize utility, usability, and customer value will realize the best ROIs and demonstrates how to start evaluating your UX.

Ted Orme is vice president of marketing and business development at Attunity, where he is responsible for Attunity’s technology alliances in EMEA, including Cloudera, Hortonworks, HP, IBM, MapR, Microsoft, and Oracle. Ted also plays a key role in shaping thought leadership and presents regularly at leading industry events. Ted has been in the industry for over 16 years. He holds a BSc in economics from the University of Kent.

Presentations

Fortune 100 lessons: Architecting data lakes for real-time analytics and AI (sponsored by Attunity) Session

Modern analytics and AI initiatives require an adaptable data lake with a multistage architectural design to effectively ingest, stage, and provision specific datasets in real time. Ted Orme discusses his experience at Attunity creating a real-time data integration solution for Fortune 100 organizations and shares best practices and lessons learned along the way.

Francois Orsini is the chief technology officer for MZ’s Satori business unit. Previously, he served as vice president of platform engineering and chief architect, bringing his expertise in building server-side architecture and implementation for a next-gen social and server platform; was a database architect and evangelist at Sun Microsystems; and worked in OLTP database systems, middleware, and real-time infrastructure development at companies like Oracle, Sybase, and Cloudscape. Francois has extensive experience working with database and infrastructure development, honing his expertise in distributed data management systems, scalability, security, resource management, HA cluster solutions, and soft real-time and connectivity services. He also collaborated with Visa International and Visa USA to implement the first Visa Cash Virtual ATM for the internet and founded a VC-backed startup called Unikala in 1999. Francois holds a bachelor’s degree in civil engineering and computer sciences from the Paris Institute of Technology.

Presentations

Correlation analysis on live data streams Session

The rate of growth of data volume and velocity has been accelerating along with increases in the variety of data sources. This poses a significant challenge to extracting actionable insights in a timely fashion. Arun Kejariwal and Francois Orsini explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.

Carl Osipov is a program manager focused on helping Google’s customers and business partners get trained and certified to run machine learning and data analytics workloads on Google Cloud. Carl has more than 16 years of experience in the IT industry and has held leadership roles for programs and projects in the areas of big data, cloud computing, service-oriented architecture, machine learning, and computational natural language processing at some of the world’s leading technology companies across the United States and Europe. Carl has written over 20 articles in professional, trade, and academic journals and holds six patents from the USPTO. He has received three corporate awards from IBM for his innovative work. You can find out more about Carl on his blog.

Presentations

Serverless machine learning with TensorFlow Tutorial

Carl Osipov walks you through building a complete machine learning pipeline from ingest, exploration, training, and evaluation to deployment and prediction.

Maria Assunta Palmieri is a junior data scientist at Data Reply. She holds a master’s degree in mathematics engineering with a specialization in statistics from the Polytechnic University of Turin.

Presentations

Discovery through real-time monitoring: A case study from the automotive industry DCS

Do you want to know if the products you make and sell are efficient or if and when they will break down? Maria Assunta Palmieri proves that real-time monitoring has the answers as she shares a case study from the automotive industry that outlines best practices for managing and analyzing telematic data in order to discover all the achievable benefits.

Paul Parau is a researcher and technical lead of the Recognos Smart Data Platform. He specializes in image processing, document layout analysis, and data extraction algorithms. Previously, Paul conducted research in the field of network science, with an emphasis on applications in social networks. He is currently focusing his research on brain network analysis.

Presentations

Spark NLP in action: Intelligent, high-accuracy fact extraction from long financial documents Session

Spark NLP natively extends Spark ML to provide natural language understanding capabilities with performance and scale that was not possible to date. David Talby, Saif Addin Ellafi, and Paul Parau explain how Spark NLP was used to augment the Recognos smart data extraction platform in order to automatically infer fuzzy, implied, and complex facts from long financial documents.

Robert Passarella evaluates AI and machine learning investment managers for Alpha Features. Rob has spent over 20 years on Wall Street in the gray zone between business and technology, focusing on using technology and innovative information sources to empower novel ideas in research and the investment process. A veteran of Morgan Stanley, JPMorgan, Bear Stearns, Dow Jones, and Bloomberg, he has seen the transformational challenges firsthand, up close and personal. Always intrigued by the consumption and use of information for investment analysis, Rob is passionate about leveraging alternative and unstructured data for use with machine learning techniques. Rob holds an MBA from the Columbia Business School.

Presentations

Findata welcome Tutorial

Hosts Alistair Croll and Robert Passarella welcome you to Findata Day.

No stone unturned: Financial research as an intelligence organization Findata

Modern organizations in financial research are actually becoming intelligence organizations, similar in respect to the CIA and MI6. Finding needles in a haystack means they need to build haystacks encompassing signal processing, unstructured data, and satellite imagery. Drawing on real-world examples, Robert Passarella explains what data offers finance and why it might actually work.

Joshua Patterson is a director of AI infrastructure at NVIDIA leading engineering for RAPIDS.AI. Previously, Josh was a White House Presidential Innovation Fellow and worked with leading experts across public sector, private sector, and academia to build a next-generation cyberdefense platform. His current passions are graph analytics, machine learning, and large-scale system design. Josh loves storytelling with data and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina Moore School of Business.

Presentations

GPU-accelerated threat detection with GOAI Session

Joshua Patterson and Mike Wendt explain how NVIDIA used GPU-accelerated open source technologies to improve its cyberdefense platforms by leveraging software from the GPU Open Analytics Initiative (GOAI) and how the company accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration.

Miha Pelko is a data engineer at BMW, where he focuses on big data for the company’s autonomous driving division. Previously, he was a consulting data scientist and data engineer at several German car manufacturers, where he introduced Hadoop and Spark in their data processing workflows, and was a data scientist in the fields of sport prediction and insurance. Miha holds a PhD in computational neuroscience from the University of Edinburgh.

Presentations

Enabling data-driven development for autonomous driving at BMW (sponsored by BMW) Session

The development of autonomous driving cars requires the handling of huge amounts of data produced by test vehicles and solving a number of critical challenges specific to the automotive industry. Miha Pelko and Aleksandr Melkonyan outline these challenges and explain how BMW is overcoming them by adapting and reinventing existing big data solutions for autonomous driving.

Nick Pentreath is a principal engineer at the Center for Open Source Data & AI Technologies (CODAIT) at IBM, where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations, and was at Goldman Sachs, Cognitive Match, and Mxit. He’s a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Deep learning for recommender systems Session

In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. Nick Pentreath explores recent advances in this area in both research and practice.

Model parallelism in Spark ML cross-validation Session

Tuning a Spark ML model using cross-validation involves a computationally expensive search over a large parameter space. Nick Pentreath and Bryan Cutler explain how enabling Spark to evaluate models in parallel can significantly reduce the time to complete this process for large workloads and share best practices for choosing the right configuration to achieve optimal resource usage.

Thomas Phelan is cofounder and chief architect of BlueData. Previously, a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially availably 64-bit file system; and an early employee at VMware, a senior staff engineer and a key member of the ESX storage architecture team where he designed and developed the ESX storage I/O load-balancing subsystem and modular pluggable storage architecture as well as led teams working on many key storage initiatives such as the cloud storage gateway and vFlash.

Presentations

Deep learning with TensorFlow and Spark using GPUs and Docker containers Session

In the past, you needed a high-end proprietary stack for advanced machine learning, but today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. Nanda Vijaydev and Thomas Phelan demonstrate how to deploy a TensorFlow and Spark with NVIDIA CUDA stack on Docker containers in a multitenant environment.

How to protect big data in a containerized environment Session

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE), but TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them.

Aurélie Pols is the chief visionary officer of Mind Your Group by Mind Your Privacy and teaches privacy and ethics at IE Business School in Madrid and Solvay Business School in Brussels. Aurélie designs data privacy best practices, documenting data flows in order to limit privacy backlashes and minimizing risk related to ever-increasing data uses while solving for data quality—the most accurate label would probably be "privacy engineer.” She used to follow the money to optimize data trails; now she follows the data to minimize increasing compliance and privacy risks and implement security best practices and ethical data use. Her mantra is: Data is the new oil; Privacy is the new green; Trust is the new currency. She has spent the past 15 years optimizing (digital) data-based decision-making processes. She also cofounded and successfully sold a startup to Digitas LBi (Publicis) and served as data governance and privacy advocate for leading data management platform (DMP) Krux Digital Inc. prior to its acquisition by Salesforce. Aurélie has spoken at various events all over the globe, including SXSW, Strata Data Conference, the IAPP’s Data Protection Congress, Webit, and eMetrics summits, and has written several whitepapers on data privacy and privacy engineering best practices. She is a member of the European Data Protection Supervisor’s (EDPS) Ethics Advisory Group (EAG), cochairs the IEEE’s P7002—Data Privacy Process standard initiative, and serves as a training advisor to the International Association of Privacy Professionals (IAPP).

Presentations

General Data Protection Regulation (GDPR) tutorial and ePrivacy introduction Tutorial

Aurélie Pols walks you through a "5+5 pillars" framework for GDPR readiness, explaining what the GDPR means to data-fueled businesses. You'll learn how to attribute responsibility to assure compliance and build toward ethical data practices, minimizing risk for your company while fostering trust with your clients.

Stuart Pook is senior DevOps engineer at Criteo, where he is part of Criteo’s Lake team that runs some small and two rather large Hadoop clusters. Stuart loves storage (208 PB at Criteo) and automation with Chef, because configuring more than 3,000 Hadoop nodes by hand is just too slow. Before discovering Hadoop, he developed
user interfaces and databases for biotech companies. Stuart has presented at ACM CHI 2000, Devoxx 2016, NABD 2016, Hadoop Summit Tokyo 2016, Apache Big Data Europe 2016, Big Data Tech Warsaw 2017, and Apache Big Data North America 2017.

Presentations

The cloud is expensive, so build your own redundant Hadoop clusters. Session

Criteo has a production cluster of 2K nodes running over 300K jobs a day in the company's own data centers. These clusters were meant to provide a redundant solution to Criteo's storage and compute needs. Stuart Pook offers an overview of the project, shares challenges and lessons learned, and discusses Criteo's progress in building another cluster to survive the loss of a full DC.

Jean-François Puget is the technical lead for IBM machine learning and optimization offerings. Jean-François has spent his entire career turning scientific ideas into innovative software. He joined IBM as part of the ILOG acquisition and since then has held various technical executive positions, including CTO for IBM Analytics Solutions. Jean-François has published over 80 scientific papers in refereed journals and top AI conferences. He holds a PhD in machine learning from Paris IX University.

Presentations

Humans and the machine: Machine learning in context (sponsored by IBM) Keynote

On the way to active analytics for business, we have to answer two big questions: What must happen to data before running machine learning algorithms, and how should machine learning output be used to generate actual business value? Jean-François Puget demonstrates the vital role of human context in answering those questions.

Maryam Qurashi is a data scientist at Pivigo, where she works with data scientists and organizations who are seeking to become more data driven. This means dealing with both the realities and possibilities of how to design and scope a data science project. She’s very interested in and curious about questions of ethics and moral philosophy as applied to data science and technology. Maryam became involved in the data science community in London as an S2DS fellow, following a career in academia as a microscopy image analyst.

Presentations

Predicting rent arrears: Leveraging data science in the public sector Session

One major challenge to social housing is determining how best to target interventions when tenants fall behind on rent payments. Jonathan Leslie, Maryam Qurashi, and Tom Harrison discuss a recent project in which a team of data scientist trainees helped Hackney Council devise a more efficient, targeted strategy to detect and prioritize such situations.

Phillip Radley is chief data architect on the core enterprise architecture team at BT, where he’s responsible for data architecture across the company. Based at BT’s Adastral Park campus in the UK, Phill leads BT’s MDM and big data initiatives, driving associated strategic architecture and investment road maps for the business. He’s worked in IT and communications for 30 years. Previously, Phill was been chief architect for infrastructure performance-management solutions from UK consumer broadband to outsourced Fortune 500 networks and high-performance trading networks. He has broad global experience, including with BT’s concert global venture in the US and five years as an Asia Pacific BSS/OSS architect based in Sydney. Phill is a physics graduate with an MBA.

Presentations

How BT delivers better broadband and TV using Spark and Kafka Session

In the past year, British Telecom has added a streaming network analytics use case to its multitenant data platform. Phillip Radley demonstrates how the solution works and explains how it delivers better broadband and TV services, using Kafka and Spark on YARN and HDFS encryption.

Syed Rafice is a principal system engineer at Cloudera specializing in big data on Hadoop technologies and both platform and cybersecurity. He is responsible for designing, building, developing, and assuring a number of enterprise-level big data platforms using the Cloudera distribution. Syed has worked across multiple sectors including government, telecoms, media, utilities, financial services, and transport.

Presentations

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Syed Rafice outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Greg Rahn is director of product management at Cloudera, where he’s responsible for driving SQL product strategy as part of the company’s data warehouse product team, including working directly with Impala. For over 20 years, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently product management, providing a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Analytics in the cloud: Building a modern cloud-based big data warehouse Session

For many organizations, the next big data warehouse will be in the cloud. Greg Rahn shares considerations for evaluating the cloud for analytics and big data warehousing, including different architectural approaches to optimize price and performance.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); worked briefly on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper. He’s the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin–Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Modern real-time streaming architectures Tutorial

The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. Karthik Ramasamy, Arun Kejariwal, and Ivan Kelly walk you through state-of-the-art streaming frameworks, algorithms, and architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Adesh Rao is a member of the technical staff on the Hive team at Qubole. He holds a degree from BITS Pilani.

Presentations

Autonomous ETL with materialized views Session

Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness.

Erin Recachinas is a software engineer and Scrum Master for the infrastructure and performance team at Zoomdata, where she and her team recently rearchitected Zoomdata’s streaming capabilities. Previously, Erin was a full stack engineer at Appian. She studied computer science and mathematics at the University of Virginia.

Presentations

You’re doing it wrong: How Zoomdata rearchitected streaming Session

The value of real-time streaming analytics with historical data is immense. Big data application Zoomdata updates historical dashboards in real time without complex reaggregations, but streaming in the age of the IoT requires handling of data in volumes not seen in traditional feeds. Erin Recachinas explains how Zoomdata moved to a scalable microservice architecture for streaming sources.

Alberto Rey is head of data science at easyJet, where he leads easyJet’s efforts to adopt advance analytics within different areas of the business. Alberto’s background is in air transport and economics, and he has more than 15 years’ experience in the air travel industry. Alberto started his career in advanced analytics as a member of the pricing and revenue management team at easyJet, working in the development of one of the most advanced pricing engines within the industry, where his team pioneered the implementation of machine learning techniques to drive pricing. He holds an MSc in data mining and an MBA from Cranfield University.

Presentations

Data science survival and growth within the corporate jungle: An easyJet case study Session

Because in-house data science teams work with a range of business functions, traditional data science processes are often too abstract to cope with the complexity of these environments. Alberto Rey Villaverde and Grigorios Mingas share case studies from easyJet that highlight some unpredictable hurdles related to requirements, data, infrastructure, and deployment and explain how they solved them.

Pierre Romera is the chief technology officer at the International Consortium of Investigative Journalists (ICIJ), where he manages a team of programmers working on the platforms that enabled more than 300 journalists to collaborate on the Paradise Papers and Panama Papers investigations. Previously, he cofounded Journalism++, the Franco-German data journalism agency behind the Migrant Files, a project that won the European Press Prize in 2015 for Innovation. He is one of the pioneers of data journalism in France.

Presentations

The Paradise Papers: Behind the scenes with the ICIJ Keynote

Last November, the International Consortium of Investigative Journalists (ICIJ) published the Paradise Papers, a yearlong investigation on the offshore dealings of multinational companies and the wealthy. Pierre Romera offers a behind-the-scenes look into the process and explores the challenges in handling 1.4 TB of data and making it available securely to journalists all over the world.

Mael Ropars is a senior sales engineer at Cloudera, helping customers solve their big data problems using enterprise data hubs based on Hadoop. Mael has 15 years’ experience working around big data, information management, and middleware in technical sales and service delivery.

Presentations

Running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Jason Wang, Eugene Fratkin, and Mael Ropars detail new paradigms to effectively run production-level pipelines with minimal operational overhead. Join in to learn how to remove barriers to data discovery, metadata sharing, and access control.

Nikki Rouda is the cloud and core platform director at Cloudera. Nik has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their IT challenges. His career spans big data, analytics, machine learning, AI, storage, networking, security, and the IoT. Nik holds an MBA from Cambridge and an ScB in geophysics and math from Brown.

Presentations

Security, governance, and cloud analytics, oh my! Session

Having so many cloud-based analytics services available is a dream come true. However, it's a nightmare to manage proper security and governance across all those different services. Nikki Rouda and Nick Curcuru share advice on how to minimize the risk and effort in protecting and managing data for multidisciplinary analytics and explain how to avoid the hassle and extra cost of siloed approaches.

Christopher Royles is a systems engineer at Cloudera, where he builds out large-scale data lakes on Amazon and Azure and assists customers from their initial MVP to full-scale production. Chris has advised on UK government open data initiatives as part of the Open Data User Group (ODUG) and sat on the quick wins stream of the UK Government Cloud Program (GCloud). He holds a PhD in artificial intelligence from Liverpool University, which he subsequently applied to voice recognition and voice dialogue systems.

Presentations

Practical advice for driving down the cost of cloud big data platforms Session

Big data and cloud deployments return huge benefits in flexibility and economics but can also result in runaway costs and failed projects. Drawing on his production experience, Christopher Royles shares tips and best practices for determining initial sizing, strategic planning, and longer-term operation, helping you deliver an efficient platform, reduce costs, and implement a successful project.

Omer Sagi is a senior data scientist on the data science team at Dell, where he leads several data science projects in the fields of precision agriculture, online marketing, failure prediction, and text classification. Omer has also taught courses on Java programing and databases. He holds a master’s degree from the Department of Industrial Engineering at Ben-Gurion University; his thesis presented a novel approach for assessing the monetary damages of data loss incidents. Omar is currently a PhD candidate in the Department of Software and Information Systems Engineering at Ben-Gurion University, focusing on developing algorithms that simplify ensemble models.

Presentations

Improving DevOps and QA efficiency using machine learning and NLP methods Session

DevOps and QA engineers spend a significant amount of time investigating reoccurring issues. These issues are often represented by large configuration and log files, so the process of investigating whether two issues are duplicates can be a very tedious task. Ran Taig and Omer Sagi outline a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues.

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka.

Presentations

Improving ad hoc and production workflows at Stitch Fix Session

Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way.

Chen Salomon is the architect at high-scale storytelling platform Playbuzz, where, as the first employee, he has been responsible for the design and implementation of a scale-ready system since day one and implemented Playbuzz’s data pipeline, which collects, enriches, and stores thousands of events per second. An experienced developer, Chen specializes in high-scale web environments, specifically caching, CDN, cloud architectures, and microservices architectures.
Chen’s academic background includes research in the fields of social networks and content distribution with a focus on online experiments.

Presentations

Are we doing this wrong? Advertisement features A/B testing Session

A/B testing is the foundation of data-driven decision making. In today's world, advertising is crucial to a website's revenue, so it is even more important to measure the effects of changes correctly. Chen Salomon demonstrates how to correctly design and implement an advertisement A/B testing and shares pitfalls, potential biases related to advertisement metrics, and possible mitigations.

Guillaume Salou is the machine learning services team leader at OVH, where he is focusing on extracting high value from specific data science applications in order to make it available to all. Previously, he worked on data lakes.

Presentations

Continuous delivery and machine learning Session

Guillaume Salou shares OVH's approach to continuous deployment of machine learning models, which involved building a full stack of automated machine learning. Automated machine learning allows the company to rebuild models efficiently and keep models up to date with fresh data brought by its data convergence tool.

Mathew Salvaris is a data scientist at Microsoft. Previously, Mathew was a data scientist for a small startup that provided analytics for fund managers; a postdoctoral researcher at UCL’s Institute of Cognitive Neuroscience, where he worked with Patrick Haggard in the area of volition and free will, devising models to decode human decisions in real time from the motor cortex using electroencephalography (EEG); and a postdoc in the University of Essex’s Brain Computer Interface Group, where he worked on BCIs for computer mouse control. Mathew holds a PhD in brain-computer interfaces and an MSc in distributed artificial intelligence.

Presentations

Distributed training of deep learning models Session

Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for.

Jim Scott is the head of developer relations, data science, at NVIDIA. He’s passionate about building combined big data and blockchain solutions. Over his career, Jim has held positions running operations, engineering, architecture, and QA teams in the financial services, regulatory, digital advertising, IoT, manufacturing, healthcare, chemicals, and geographical management systems industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

Using a global data fabric to run a mixed cloud deployment Session

Creating a business solution is a lot of work. Instead of building to run on a single cloud provider, it is far more cost effective to leverage the cloud as infrastructure as a service (IaaS). Jim Scott explains why a global data fabric is a requirement for running on all cloud providers simultaneously.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Julie Shin is the head of strategic operations and innovation at Citigroup, where she executes enterprise transformation initiatives with management teams that drive the organization’s strategy across products, geographies, operations, technology, and global functions. Julie also manages a portfolio of fintech ventures to test for proof of concept and commercialization viability in improving client experiences, scaling internal capability building, delivering efficiency and growth opportunities, and enabling open innovation with partnership from businesses and senior leadership teams. Previously, Julie was an investment banker at Deutsche Bank covering retail, consumer, and financial technology verticals, and in corporate development and executive relationship management roles at Lincoln Financial Group. Julie holds a BA from Northwestern University and an MBA from the University of Chicago’s Booth School of Business and completed postbaccalaureate studies at the Wharton School of the University of Pennsylvania. She holds FINRA 6, 7, and 63 licenses and is an active member of the New York community. She’s also an angel investor. Julie is an honoree of Innovate Finance’s Women in FinTech Power List and Remodista’s Women2Watch in Business Disruption.

Presentations

Data, AI, and innovation in the enterprise Session

What are the latest initiatives and use cases around data and AI? How are data and AI reshaping industries? How do we foster a culture of data and innovation within a larger enterprise? What are some of the challenges of implementing AI within the enterprise setting? Michael Li moderates a panel of experts in different industries to answer these questions and more.

Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Presentations

Data science across data sources with Apache Arrow Session

It's often impractical for organizations to physically consolidate all data into one system. Tomer Shiran offers an overview of Apache Arrow, an open source columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real time, simplifying and accelerating data access without having to copy all data into one location.

Zubin Siganporia is the founder and CEO of QED Analytics, a consultancy specializing in mathematical modeling and machine learning. His work has involved a wide variety of industries, including cryptography, financial algorithms, genetics, and the design of strategy and training systems for world-leading sports teams and Olympic squads. Zubin is a fellow in industrial and applied mathematics at Oxford University and a lecturer and tutor in pure mathematics. He was recently voted the most outstanding tutor in Oxford across the Mathematical, Physical, Life, and Engineering Sciences departments.

Presentations

When to KISS Keynote

The KISS principle tells us to "Keep it simple, stupid." As machine learning techniques become more sophisticated, the need to KISS only becomes greater. Zubin Siganporia discusses the role that simplicity plays in approaching a problem and then convincing end users to adopt data-driven solutions to their challenges.

Kevin Sigliano is a professor of digital transformation and entrepreneurship at IE Business School. Kevin is also managing partner of digital transformation consulting firm Good Rebels. He has over 15 years of corporate experience in management consulting firms including PwC and IBM BCS. As an entrepreneur, Kevin has launched numerous startups; over the last 10 years, he has been involved in the international development of a smart cities and the digital signage company, Admira. He has always reserved a part of his time to do pro bono and educational work.

Presentations

Executive Briefing: The ROI of data-driven digital transformation Session

Financial and consumer ROI demands that business leaders understand the drivers and dynamics of digital transformation and big data. Kevin Sigliano explains why disrupting value propositions and continuous innovation are critical if you wish to dramatically improve the way your company engages customers and creates value and maximize financial results.

Vartika Singh is a field data science architect at Cloudera. Previously, Vartika was a data scientist applying machine learning algorithms to real-world use cases ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine learning techniques.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Marton Balassi, Steven Totman, and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Barry Singleton is the vice president of client engagement at innovative AI company IMC Business Architecture, which is pioneering behavioral AI. He helps some of the world’s largest companies design and implement AI that changes the way they understand their big data and is responsible for establishing commercial and academic partnerships in the UK. Most notably, Barry helped companies including large financial institutions, the NHS, the Leeds Institute of Medical Education, the University of Leeds, and a number of British charities better understand the latent data all organizations possess. Previously, he worked with High Street retailers and globally recognized brands to transition from analog to digital technologies.

Presentations

Blind men and elephants: What’s missing from your big data? Session

Big data analytics tends to focus on what is easily available, which is by and large data about what has already happened, the implicit assumption being that past behavior will predict future behavior. Organizations already possess data they aren’t exploiting. Barry Singleton and Richard Goyder explain how, with the right tools, it can be used to develop far more powerful predictive algorithms.

Konrad Sippel leads Deutsche Börse’s Content Lab, which develops unique and scalable IP to drive growth to the company’s services and products. Previously, Konrad led business development at STOXX Ltd.

Presentations

How Deutsche Börse designed a world-class analytics lab Findata

Deutsche Börse has built out a data science team, the Content Lab, dedicated to advanced analytics efforts focused on driving new products in risk management and investment decision making. Konrad Sippel discusses the people, processes, and technology that make up this initiative and the early successes driven by the Content Lab.

Abhishek Somani is a senior staff engineer engineer on the Hive team at Qubole. Previously, Abhishek worked at Citrix and Cisco. He holds a degree from NIT Allahabad.

Presentations

Autonomous ETL with materialized views Session

Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness.

Ramesh Sridharan is a machine learning engineering manager at Captricity. Ramesh is passionate about using technology for social good, and his research has helped enable a cross-collaboration between researchers and doctors to understand large, complex medical image collections, particularly in predicting the effects of diseases such as Alzheimer’s on brain anatomy. He holds a PhD in electrical engineering and computer science from MIT’s Computer Science and Artificial Intelligence Lab (CSAIL), where his thesis focused on developing machine learning and computer vision technologies to enhance medical image analysis.

Presentations

How Captricity manages 10,000 tiny deep learning models in production Session

Most uses of deep learning involve models trained with large datasets. Ramesh Sridharan explains how Captricity uses deep learning with tiny datasets at scale, training thousands of models using tens to hundreds of examples each. These models are dynamically trained using an automatic deployment framework, and carefully chosen metrics further exploit error properties of the resulting models.

Stamatis Stefanakos is a managing director with D ONE, a premium business consultancy headquartered in Zurich, where he advises his clients in data analytics focusing on architecture and strategy. He holds a PhD in theoretical computer science from the Swiss Federal Institute of Technology Zurich (ETH) and a diploma from the University of Patras.

Presentations

Big data meets renewable energy: Building a real-time asset management platform for renewable energy Session

Switzerland-based startup WinJi capitalizes on two current megatrends: big data and renewable energy. Stamatis Stefanakos offers an overview of WinJi's TruePower Asset Management Platform, covering the overall architecture and the motivation behind it, the physics behind the data, and the business case.

Bargava Subramanian is a cofounder and deep learning engineer at Binaize in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies. He mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland, College Park. He’s an ardent NBA fan.

Presentations

Architectural design for interactive visualization Session

Creating visualizations for data science requires an interactive setup that works at scale. Bargava Subramanian and Amit Kapoor explore the key architectural design considerations for such a system and discuss the four key trade-offs in this design space: rendering for data scale, computation for interaction speed, adapting to data complexity, and being responsive to data velocity.

Deep learning in the browser: Explorable explanations, model inference, and rapid prototyping Session

Amit Kapoor and Bargava Subramanian lead three live demos of deep learning (DL) done in the browser—building explorable explanations to aid insight, building model inference applications, and rapid prototyping and training an ML model—using the emerging client-side JavaScript libraries for DL.

Ran Taig is a senior data scientist at Dell, responsible for both the business and the scientific aspects of the data science lifecycle. A machine learning practitioner with a strong academic background, Ran is also an experienced lecturer who has delivered core CS courses to undergrads at Ben-Gurion University. He is fluent in the common data science toolbox (Python, pandas, SQL, Spark, etc.). Ran holds a PhD in artificial intelligence.

Presentations

Improving DevOps and QA efficiency using machine learning and NLP methods Session

DevOps and QA engineers spend a significant amount of time investigating reoccurring issues. These issues are often represented by large configuration and log files, so the process of investigating whether two issues are duplicates can be a very tedious task. Ran Taig and Omer Sagi outline a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby and Claudiu Branzan lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Spark NLP in action: Intelligent, high-accuracy fact extraction from long financial documents Session

Spark NLP natively extends Spark ML to provide natural language understanding capabilities with performance and scale that was not possible to date. David Talby, Saif Addin Ellafi, and Paul Parau explain how Spark NLP was used to augment the Recognos smart data extraction platform in order to automatically infer fuzzy, implied, and complex facts from long financial documents.

Elena Terenzi is a software development engineer at Microsoft, where she brings business intelligence solutions to Microsoft Enterprise customers and advocates for business analytics and big data solutions for the manufacturing sector in Western Europe, such as helping big automotive customers implement telemetry analytics solutions with IoT flavor in their enterprises. She started her career with data as a database administrator and data analyst for an investment bank in Italy. Elena holds a master’s degree in AI and NLP from the University of Illinois at Chicago.

Presentations

Detecting small-scale mines in Ghana Session

Michael Lanzetta and Elena Terenzi offer an overview of a collaboration between Microsoft and the Royal Holloway University that applied deep learning to locate illegal small-scale mines in Ghana using satellite imagery, scaled training using Kubernetes, and investigated the mines' impact on surrounding populations and environment.

Ankit Tharwani is vice president and head of big data at Barclays UK. His role spans a number of areas, including building the connected data platform ecosystem, driving engineering and adoption of new tooling, and leading security design and the cloud journey of the data assets. Ankit specializes in big data technologies and has led the delivery of several “first times” in this space, including the Elastic Data Platform, real-time warehouses, and the first commercial big data use case for Barclays.

Presentations

Moving machine learning and analytics to hyperspeed Keynote

Imagine the value you could drive in your business if you could accelerate your journey to machine learning and analytics. Amr Awadallah, Ankit Tharwani, and Bala Chandrasekaran explain how Barclays has driven innovation in real-time analytics and machine learning with Apache Kudu, accelerating the time to value across multiple business initiatives, including marketing, fraud prevention, and more.

Niranjan Thomas is general manager of platform and technology partnerships within Dow Jones’s professional information business, where he is responsible for the DNA data platform, including product strategy, go-to-market, customer solutions, and driving technology ecosystem partnerships. Niranjan has over 16 years of experience in technology leadership roles across software design and development, software project management, cybersecurity, and risk and business management. He has spent significant time delivering solutions for the manufacturing, consulting, technology, telecommunications, media, and financial services industries. Previously, he was head of technology at AMP, Australia’s leading specialist wealth management and life insurance company. Niranjan holds a bachelor’s degree in business information systems from RMIT University in Melbourne, Australia.

Presentations

Unlocking the hidden potential of bad news: Using news-derived data to uncover and solve complex societal problems DCS

What do hurricanes, the Zika virus, and modern slavery have in common? Obviously, they’re serious global issues that cause human suffering, death, and destruction, but they’re also challenges the Strata community can seek to better understand in order to minimize their negative consequences—all with news-derived data. Niranjan Thomas issues a call to think creatively about new datasets.

Deepak Tiwari is the head of product management for data at Lyft, where he’s responsible for the company’s data vision as well as for building its data infrastructure, data platform, and data products. This includes Lyft’s streaming infrastructure for real-time decision making, geodata store and visualization, platform for machine learning, and core infrastructure for big data analytics. Previously, he was a product management leader at Google, where he worked on search, cloud, and technical infrastructure products. Deepak is passionate about building products that are driven by data, focus on user experience, and work at web scale. He holds an MBA from Northwestern’s Kellogg School of Management and a BT in engineering from the Indian Institute of Technology, Kharagpur.

Presentations

Democratizing data within your organization Session

Sure, you’ve got the best and fastest running SQL engine, but you’ve still got some problems: Users don’t know which tables exist or what they contain; sometimes bad things happen to your data, and you need to regenerate partitions but there is no tool to do so. Mark Grover and Deepak Tiwari explain how to make your team and your larger organization more productive when it comes to consuming data.

Steven Totman is the financial services industry lead for Cloudera’s Field Technology Office, where he helps companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Prior to Cloudera, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents for data-integration and governance/metadata-related designs.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Marton Balassi, Steven Totman, and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Steve Touw is the cofounder and CTO of Immuta. Steve has a long history of designing large-scale geotemporal analytics across the US intelligence community, including some of the very first Hadoop analytics, as well as frameworks to manage complex multitenant data policy controls. He and his cofounders at Immuta drew on this real-world experience to build a software product to make data security and privacy controls easier. Previously, Steve was the CTO of 42six (acquired by Computer Sciences Corporation), where he led a large big data services engineering team. Steve holds a BS in geography from the University of Maryland.

Presentations

How will the GDPR impact machine learning? Session

The Strata Data conference in London takes place during one of the most important weeks in the history of data regulation, as GDPR begins to be enforced. Steve Touw explores the effects of the GDPR on deploying machine learning models in the EU.

Michael Troughton is the GM for Ex-Americas for Conduce. He is based in Amsterdam, the Netherlands.

His background is in law, tech and renewable energy finance.

Presentations

How DHL is increasing efficiency and reducing distance traveled across the warehouse with the IoT DCS

DHL has partnered with Conduce to provide a human interface that provides real-time visualizations that track and analyze distance traveled by personnel and warehouse equipment, all calibrated around a center of activity. Michael Troughton explains how this immersive data visualization gives DHL unprecedented insight to evaluate and act on everything that occurs in its warehouses.

Teresa Tung is a managing director at Accenture, where she’s responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s Applied Intelligence Platform. Teresa is Accenture’s most prolific inventor with 170+ patent and applications. She holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

Executive Briefing: Becoming a data-driven enterprise—A maturity model Session

A data-driven enterprise maximizes the value of its data. But how do enterprises emerging from technology and organization silos get there? Teresa Tung and Jean-Luc Chatelain explain how to create a data-driven enterprise maturity model that spans technology and business requirements and walk you through use cases that bring the model to life.

14 years of professional experience in IT consulting specializing in the areas of Artificial Intelligence,Big Data, Cloud, Business Intelligence, Technical Architectures and IT Strategy. Knowledge of the life cycle of complex transformation projects from conception to implementation.

Presentations

The eAGLE accelerator: How to speed up migrations from legacy ETL to big data implementations Session

Enric Biosca offers an overview of the eAGLE accelerator, which speeds up migration processes from legacy ETL to big data implementations by enabling auditing, lineage, and translation of legacy code for big data. Along the way, Enric demonstrates how graph and automatic translation technologies help companies reduce their migration times.

Kate Vang is a data scientist at the One Campaign and a chapter lead at DataKind UK, where she consults with charities, NGOs, and corporations to find stories and insights in data. Previously, Kate worked in investment management at Värde Partners, where she held roles across risk management, trading, investment, data strategy, and portfolio strategy.

Presentations

Executive Briefings: Killer robots and how not to do data science Session

Not a day goes by without reading headlines about the fear of AI or how technology seems to be dividing us more than bringing us together. DataKind UK is passionate about using machine learning and artificial intelligence for social good. Kate Vang and Christine Henry explain what socially conscious AI looks like and what DataKind is doing to make it a reality.

Vinithra Varadharajan is a senior engineering manager in the cloud organization at Cloudera, where she’s responsible for the cloud portfolio products, including Altus Data Engineering, Altus Analytic Database, Altus SDX, and Cloudera Director. Previously, Vinithra was a software engineer at Cloudera working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

Running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Jason Wang, Eugene Fratkin, and Mael Ropars detail new paradigms to effectively run production-level pipelines with minimal operational overhead. Join in to learn how to remove barriers to data discovery, metadata sharing, and access control.

Emre Velipasaoglu is principal data scientist at Lightbend. A machined learning expert, Emre previously served as principal scientist and senior manager at Yahoo! Labs. He has authored 23 peer-reviewed publications and nine patents in search, machine learning, and data mining. Emre holds a PhD in electrical and computer engineering from Purdue University and completed postdoctoral training at Baylor College of Medicine.

Presentations

Machine-learned model quality monitoring in fast data and streaming applications Session

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu reviews monitoring methods, focusing on their applicability in fast data and streaming applications.

Nanda Vijaydev is the lead data scientist and head of solutions at BlueData (now HPE), where she leverages technologies like TensorFlow, H2O, and Spark to build solutions for enterprise machine learning and deep learning use cases. Nanda has more than 10 years of experience in data science and data management. Previously, she worked on data science projects in multiple industries as a principal solutions architect at Silicon Valley Data Science and served as director of solutions engineering at Karmasphere.

Presentations

Deep learning with TensorFlow and Spark using GPUs and Docker containers Session

In the past, you needed a high-end proprietary stack for advanced machine learning, but today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. Nanda Vijaydev and Thomas Phelan demonstrate how to deploy a TensorFlow and Spark with NVIDIA CUDA stack on Docker containers in a multitenant environment.

Jivan Virdee is a data designer on the data and design team at Fjord, where she focuses on building adaptable, human-centric data-driven experiences. Jivan has a background in data science and has worked mainly in the healthcare and retail sectors, using data to drive design and business decision making. Previously, she was a part of the advanced analytics and AI team at Accenture’s multidisciplinary research and incubation hub in Dublin.

Presentations

Designing ethical artificial intelligence Session

Artificial intelligence systems are powerful agents of change in our society, but as this technology becomes increasingly prevalent—transforming our understanding of ourselves and our society—issues around ethics and regulation will arise. Jivan Virdee and Hollie Lubbock explore how to address fairness, accountability, and the long-term effects on our society when designing with data.

Humanizing data: How to find the why DCS

Data has opened up huge possibilities for analyzing and customizing services. However, although we can now manage experiences to dynamically target audiences and respond immediately, context is often missing. Hollie Lubbock and Jivan Virdee share a practical approach to discovering the reasons behind the data patterns you see and help you decide what level of personalized service to create.

Naghman Waheed is the data platforms lead at Bayer Crop Science, where he’s responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order to cash, finance, and procurement. Throughout his 20+ year career at Bayer, Naghman has held a variety of positions in the data space, ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.

Presentations

You call it data lake; we call it Data Historian. Session

There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security.

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He’s head of developer relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he’s the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He earned his PhD in physics from the University of Washington.

Presentations

Ask Me Anything: Streaming applications and architectures Session

Join Dean Wampler and Boris Lublinsky to discuss all things streaming: architecture, implementation, streaming engines and frameworks, techniques for serving machine learning models in production, traditional big data systems (dying or still relevant?), and general software architecture and data systems.

Executive Briefing: What you need to know about fast data Session

Streaming data systems, so called fast data, promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler outlines what you need to know to exploit fast data successfully.

Kafka streaming microservices with Akka Streams and Kafka Streams Tutorial

Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Along the way, Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead.

Hope Wang is a software engineer in Intuit’s Small Business Data and Analytics Group. Hope is a self-taught, self-motivated, fully powered hacker who is passionate about innovation. She holds a master’s degree in biomedical engineering from the University of Southern California.

Presentations

Machine learning platform lifecycle management Session

A machine learning platform is not just the sum of its parts; the key is how it supports the model lifecycle end to end. Hope Wang explains how to manage various artifacts and their associations, automate deployment to support the lifecycle of a model, and build a cohesive machine learning platform.

Jason Wang is a software engineer at Cloudera focusing on the cloud.

Presentations

Running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Jason Wang, Eugene Fratkin, and Mael Ropars detail new paradigms to effectively run production-level pipelines with minimal operational overhead. Join in to learn how to remove barriers to data discovery, metadata sharing, and access control.

Matthew Ward is a Commercial Principal at ASI. He has more than seven years experience in advising energy and utility companies, helping to transform their customer engagement strategies using data solutions. Prior to joining ASI, Matthew was responsible for Client Success at Opower (acquired by Oracle in 2016), leading customer success, implementation engineering and project management teams. He holds a Bachelors in Climatology from McGill University and a Masters of International Energy Policy from the Middlebury Institute.

Presentations

Data science for managers 2-Day Training

Jean Innes, Matthew Ward, Emanuele Haerens, and Alli Paget lead a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Data science for managers (Day 2) Training Day 2

The instructors offer a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Jim Webber is chief scientist at Neo Technology, where he works on next-generation solutions for massively scaling graph data. Previously, Jim was a professional services director with ThoughtWorks, where he worked on large-scale computing systems in finance and telecoms. Jim holds a PhD in computing science from Newcastle University in the UK.

Presentations

Mixing causal consistency and asynchronous replication for large Neo4j clusters Session

Jim Webber details how Neo4j mixes the strongly consistent Raft protocol with async log shipping and provides a strong consistency guarantee: causal, which means you can always at least read your writes even in very large multidata center clusters.

Yiran Wu is a big data platform development engineer at JD.com, where he is mainly engaged in the construction and development of the company’s big data platform, using open source projects such as Hadoop, Spark, Hive, and Alluxio. He focuses on the big data ecosystem and is an open source developer, Alluxio contributor, and Hadoop contributor.

Presentations

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks Session

Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.

Tony Xing is a senior product manager on the AI, data, and infrastructure (AIDI) team within Microsoft’s AI and Research Organization. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service Group, where he worked on products for data ingestion, real-time data analytics, and the data quality platform.

Presentations

Bringing AI to BI: Microsoft's road to automated business incident monitoring and diagnostics with Project Kensho Session

Tony Xing and Bixiong Xu offer an overview of Project Kensho, Microsoft's one-stop shop for business incident monitoring and automated insights. Tony and Bixiong cover the technology's evolution, the architecture, the algorithms, and the benefits and the trade-offs. Along the way, they share a case study on Bing ads key metrics monitoring and automated diagnostic insights.

Bixiong Xu is the principal development manager on the AI, data, and infrastructure team at Microsoft.

Presentations

Bringing AI to BI: Microsoft's road to automated business incident monitoring and diagnostics with Project Kensho Session

Tony Xing and Bixiong Xu offer an overview of Project Kensho, Microsoft's one-stop shop for business incident monitoring and automated insights. Tony and Bixiong cover the technology's evolution, the architecture, the algorithms, and the benefits and the trade-offs. Along the way, they share a case study on Bing ads key metrics monitoring and automated diagnostic insights.

Fabian Yamaguchi is the chief scientist at ShiftLeft. Fabian has over 10 years of experience in the security domain, where he has worked as a security consultant and researcher focusing on manual and automated vulnerability discovery. He has identified previously unknown vulnerabilities in popular system components and applications such as the Microsoft Windows kernel, the Linux kernel, the Squid proxy server, and the VLC media player. Fabian is a frequent speaker at major industry conferences such as Black Hat USA, DEF CON, First, and CCC and renowned academic security conferences such as ACSAC, Security and Privacy, and CCS. He holds a master’s degree in computer engineering from Technical University Berlin and a PhD in computer science from the University of Goettingen.

Presentations

Code Property Graph: A modern, queryable data storage for source code Session

Fabian Yamaguchi offers an overview of Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.

Han Yang is a senior product manager at Cisco, where he drives UCS solutions for artificial intelligence and machine learning. And he’s always enjoyed driving technologies. Previously, Han drove the big data and analytics UCS solutions and the largest switching beta at Cisco with the software virtual switch, Nexus 1000V. Han has a PhD in electrical engineering from Stanford University.

Presentations

Incorporating data sources inside and outside of the data center (sponsored by Cisco) Session

Han Yang explains how Cisco is leveraging big data and analytics and details how the company is helping customers to incorporate data sources from the internet of things and deploy machine learning at the edge and at the enterprise.

Juan Yu is a software engineer at Cloudera working on the Impala project, where she helps customers investigate, troubleshoot, and resolve escalations and analyzes performance issues to identify bottlenecks, failure points, and security holes. Juan also implements enhancements in Impala to improve customer experience. Previously, Juan was a software engineer at Interactive Intelligence and held developer positions at Bluestreak, Gameloft, and Engenuity.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Marton Balassi, Steven Totman, and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Wataru Yukawa is a data engineer at LINE, where he is creating and maintaining a log analysis platform based on Hadoop, Hive, Fluentd, Presto, and Azkaban and working on aggregating log and RDBMS data with Hive and reporting using BI tools.

Presentations

Batch and real-time processing in LINE's log analysis platform Session

LINE—one of the most popular messaging applications in Asia—offers many services, such as its news application. These services sometimes depend on real-time processing. Wataru Yukawa offers an overview of LINE's web tracking system, which consists of the JavaScript SDK, NGINX Fluentd, Kafka, Elasticsearch, and Hadoop, and explains how it helps with batch and real-time processing.