Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Speakers

Experts and innovators from around the world share their insights and best practices. New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Nidhi is an entrepreneur who is passionate about building radical products. She co-founded cloud configuration management startup qwikLABS. qwikLABS was acquired by Google and still remains the exclusive platform used by AWS customers and partners worldwide to create and deploy on-demand lab environments on the cloud. Nidhi most recently led Product, Strategy, Marketing and Finance at data integration company Tamr. Before founding qwikLABS, Nidhi worked at McKinsey & Company, working on Big Data and Cloud Strategy. Nidhi holds a Ph.D. in Computer Science from the University of Wisconsin- Madison and holds 6 US patents. Follow Nidhi on the Medium publication, Radical Product and on Twitter @aggarwalnidhi.

Presentations

Measure What Matters: How your measurement strategy can reduce OpEx Tutorial

These days it’s easy for companies to say, "We measure everything!” The problem is, most “popular” metrics may not be appropriate or relevant for your business. Measurement isn’t free, and should be done strategically. This session covers how you can align measurement with your product strategy, so you can measure what matters for your business.

Alasdair Allan is a scientist and researcher who has authored over eighty peer reviewed papers, eight books, and has been involved with several standards bodies. Originally an astrophysicist he now works as a consultant and journalist, focusing on open hardware, machine learning, big data, and emerging technologies — with expertise in electronics, especially wireless devices and distributed sensor networks, mobile computing, and the "Internet of Things.” He runs a small consulting company, and has written for Make: Magazine, Motherboard/VICE, Hackaday, Hackster.io, and the O’Reilly Radar. In the past he has mesh networked the Moscone Center, caused a U.S. Senate hearing, and contributed to the detection of what was—at the time—the most distant object yet discovered.

Presentations

Executive Briefing: Data Privacy in the age of the Internet of Things Session

The current age where privacy is no longer "a social norm" may not long survive the coming of the Internet of Things. Big data is all very well when it is harvested quietly and stealthily. But when your things tattle on you behind your back, it'll be a very different matter altogether. The rush to connect devices to the Internet has led to sloppy privacy controls. That can't continue.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time systems with Spark Streaming and Kafka 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Brian Arnold is the lead architect for the Data Historian Platform at Monsanto, and is responsible for guiding the technical direction and implementation for the platform. Brian has 10 years of experience as an IT professional, working on a large-scale ecommerce website and implementing various Big Data applications. Utilizing numerous Big Data and Cloud technologies, he is experienced in building recommendations system platforms, as well as an enterprise data lakes. While at Monsanto, he assisted in implementing our enterprise Kafka platform, and is now focused on leading a team of engineers to build Data Historian. Brian is passionate about Big Data technologies, the Cloud, Data Science, and Functional Programming. Brian holds a BS in Computer Engineering with a minor in mathematics from Marquette University.

Presentations

You call it Data Lake, we call it Data Historian Session

Last few years have seen a number of tools appear in the market which make it easy to implementation a Data Lake. However, most tools lack essential features that prevent the data lake from turning into a data swamp. At Monsanto, our data platform engineering team embarked on building a platform which can ingest, store, access data sets without compromising ease of use, governance and security.

David Asboth is a Data Scientist at Cox Automotive Data Solutions. He was originally a software developer for 6 years before deciding to change his career and did an MSc in Data Science to do so. His day job mostly involves creating value from messy and incomplete data.

Presentations

Scaling Data Science (Teams and Technologies) Session

Cox Automotive is the world’s largest automotive service organisation, and that means we can combine data from across the entire vehicle lifecycle. We are on a journey to turn this data into insights and want to share some of our experiences both in building up a data science team and scaling the data science process (from laptop to Hadoop cluster).

Eran is a senior software engineer in the Advanced Analytics department at Intel where he enjoys everything distributed. From Spark and Kafka to Kubernetes and Tensorflow, He loves playing with them all. Eran poses a MS in computer science from the Hebrew University of Jerusalem.

Presentations

Real Time Deep Learning on Video Streams Session

Deep learning is revolutionizing many domains within computer vision, however real-time analysis is challenging. To address this, we have constructed a novel architecture that enables real time analysis of high-resolution streaming video. Our solution is a fully asynchronous system based on Redis,Docker and Tensorflow, that nonetheless gives the user the notion of a real-time video feed analysis

Chinese. Focus on big data.

Presentations

Using Alluxio(formerly Tachyon) as a fault-tolerant pluggable optimization component to compute frameworks of JD system Session

JD.com use Alluxio to provide support for ADHOC and real-time stream computing. One of them, the JDPresto on Alluxio has led to a 10x performance improvement on average. We use Alluxio compatible hdfs-url and as a pluggable optimization component.

Jason Bell is a Data Engineer at Mastodon C, he specialises in high volume streaming systems, BigData solutions and also machine learning applications.

He was section editor for Java Developer’s Journal, contributed to IBM developerWorks on Autonomic Computing and authored the book “Machine Learning: Hands on for Developers and Technical Professionals”.

Presentations

Learning how to design automatically updating AI with Apache Kafka and DeepLearning4J Session

Using Apache Kafka and DeepLearning4J Jason Bell presents a the design and implementation of a self learning knowledge system, the design rationale behind it and the implications of using a streaming data with deep learning and artificial intelligence.

Anya loves her position as Senior Member of Technical Staff (SRE) at Salesforce. She’s also a co-organizer of the SF Big Analytics meetup group, and is always looking for ways to make platforms more scalable / cost efficient / secure. Before Salesforce, Anya enjoyed contributing at Alpine Data where she focused on Spark Operations. The opinions expressed in this presentation do not reflect those of Anya’s employers, past or present.

Presentations

Understanding Spark Tuning with Auto Tuning (or magical spells to stop your pager going off at 2am*) Session

Apache Spark is an amazing distributed system, but part of the bargain we've all made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. This talk will look at auto-tuning jobs using both historical and live job information using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Albert Bifet is a big data scientist with 10+ years of international experience in research and in leading new open source software projects for business analytics, data mining, and machine learning (Huawei, Yahoo, University of Waikato, UPC). He obtained a Ph.D. from UPC-BarcelonaTech. Albert has worked in Hong Kong, New Zealand, and Europe. At Yahoo Labs, he co-founded Apache SAMOA (Scalable Advanced Massive Online Analysis) in 2013. Apache SAMOA is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms. At the WEKA Machine Learning group, he has co-led MOA (Massive Online Analysis) since 2008. MOA is the most popular open source framework for data stream mining, with more than 20,000 downloads each year. Albert is the author of the book Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams. Additionally, he was editor of the Big Data Mining special issue of SIGKDD Explorations in 2012. Also, he is serving as co-chair of the Industrial track of ECML PKDD 2015, and served as co-chair of BigMine (2017-2012), and ACM SAC Data Streams Track (2018-2012).

Presentations

StreamDM: Advanced data science with Spark Streaming Session

We present StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei Noah’s Ark Lab and Telecom ParisTech.

Lee Blum – Common Technology Center, Verint Systems

Lee is a Big Data Architect in the Verint Common Technology Center. He is responsible for designing Big Data solutions on Large Scale Cyber Defense systems. In his role, Lee brings the latest Big Data technologies to provide rapid ingestion, processing and advanced analytics of data, collected by high-end cyber probes in Internet Service Provider networks. With over 15 years of experience in both network oriented, back-end development, Big Data architecture and analytics, Lee works with the Product Management, Research and the Engineering teams to support the realization and implementation of advanced algorithms and data analytics in Petabyte-scale data repositories.

Presentations

Creating the Ultimate Data Scientist’s Cyber Playground: Building a Multi-Petabyte Analytic Infrastructure for Cyber Defense Session

Using an actual complex case study, Lee Blum will share how we built our Large Scale Cyber Defense system to serve our data scientists with versatile analytic operations on petabytes of data and trillions of records. He will discuss our extremely challenging use case, decision considerations, major design challenges, tips and tricks and the system’s overall results.

Behzad Bordbar is a mathematician, Software Engineer and Big data Technical Instructor. He has been working in academia for over 12 years where he performed research and developed software as well as working as visiting scientist at HP, BT and IBM. Currently he is teaching courses such as Hadoop, Hive, Impala and Spark at Cloudera.

Presentations

Data science and machine learning with Apache Spark 2-Day Training

The instructor demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Claudiu Branzan is the director of data science at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. This is a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Mikio Braun is principal engineer for search at Zalando, one of the biggest European fashion platforms. Mikio holds a PhD in machine learning and worked in research for a number of years before becoming interested in putting research results to good use in the industry.

Presentations

Machine Learning for Time Series: What works and what doesn't Session

Time series data has many applications in industry, in particular predicting the future based on historical data. In this talk we will review time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations what works and what does not and in which context.

Keynote with Mikio Braun Keynote

Mikio Braun, Zalando SE

Andrew is the leading authority on the intersection between machine learning, regulation and law. Prior to joining Immuta, Andrew served as Special Advisor for Policy to the head of the FBI Cyber Division, where he served as lead author on the FBI’s after-action report for the 2014 attack on Sony. He is also a visiting fellow at Yale Law School’s Information Society Project, most recently participating in an expert discussion on how to structure a basic framework for governing machine learning and AI.

An author and former reporter, Andrew has been published in the Financial Times, The Los Angeles Times, Slate, The Yale Journal of International Affairs, and most recently, The New York Times for an article titled, “The End of Privacy.” His book, American Hysteria: The Untold Story of Mass Political Extremism in the United States (Lyons Press, 2015), was called “a must read book dealing with a topic few want to tackle” by Nobel laureate Archbishop Emeritus Desmond Tutu.

Andrew earned his J.D. from Yale Law School, and B.A. from McGill University. He is a term-member of the Council on Foreign Relations, a member of the Washington, D.C. and Virginia State Bars, and a Global Information Assurance Certified (GIAC) cyber incident response handler.

Presentations

How Will the GDPR Impact Machine Learning? Session

Strata Data London 2018 will take place during one of the most important weeks in the history of data regulation - when GDPR begins to be enforced. This talk will explore the effects of the GDPR on deploying machine learning models in the EU.

Gianfranco is a data scientist and crowdsourcing specialist and a DataKind UK Chapter Lead. Di.Co.Im. is the consulting startup he founded to explore how to best bring people, tech and data together.

Presentations

Executive Briefings: Killer robots and how not to do data science Session

Not a day goes by without reading headlines about the fear of AI or how technology seems to be dividing us more than bringing us together. Here at DataKind UK we're passionate about how machine learning and artificial intelligence can be used for social good. We'll talk about what socially conscious AI looks like, and what we're doing to make it a reality.

Simon Chan is a senior director of product management for Salesforce Einstein, where he oversees platform development and delivers products that empower everyone to build smarter apps with Salesforce.

Previously, Simon was the cofounder and CEO of PredictionIO, a leading open source machine learning server (acquired by Salesforce).

Simon is a product innovator and serial entrepreneur with more than 14 years of global technology management experience in London, Hong Kong, Guangzhou, Beijing, and the Bay Area. Simon holds a BSE in computer science from the University of Michigan, Ann Arbor, and a PhD in machine learning from University College London.

Presentations

The Journey of Machine Learning Platform Adoption in Enterprise Session

The promises of AI are great, but taking the steps to implement AI within an enterprise is challenging. The secret behind enterprise AI success often traces back to the underlying platform that accelerates AI development at scale. Based on years of experiences helping executives establish AI product strategies, Dr. Simon Chan walks through the AI platform journey that is right for your business.

Jean-Luc Chatelain is a Managing Director for Accenture Digital and the CTO for Accenture Applied Intelligence. His focus is on helping Accenture customers become information-powered enterprise by architecting state of the art big data to outcomes solutions.

Prior to joining Accenture, Chatelain was the EVP Strategy & Technology for DataDirect Networks Inc. (DDN), the world’s largest privately held big data storage company with solutions deployed in High Performance Computing, Intelligence & Defense, Media & Entertainment & cloud service providers markets. He has been a Board Member of DDN since 2007 and joined DDN in February 2011 with several decades of experience as a technology industry leader. In that role he lead the R&D efforts of DDN and was responsible for Corporate and Technology Strategy.

Prior to DDN, Chatelain was a Hewlett-Packard Fellow where he served HP as VP & CTO of Information Optimization where he was responsible for leading HP’s Information Management & Business Analytics strategy. Chatelain helped craft the overall information management and business intelligence strategy for HP through customer interaction and by serving as an ambassador to the other constituencies of HP such as IPG & HP Labs. His function involved numerous worldwide speaking engagements to both educate customers, analysts and press on enterprise information governance management, BI & Analytics. He promoted HP’s holistic approach to information assets management. He joined HP at the time of acquisition of Persist Technologies, where he was Founder and CTO, a leader in hyper-scale grid storage & archiving solutions and which technology has been the basis of the HP Information Archiving Platform IAP.

Prior to HP, he was the CTO and Senior Vice President of Strategic Corporate Development for Zantaz, a leading service provider of information archiving solutions for the financial industry. There, he played an instrumental role in the development of the company’s services and raised millions of dollars in capital for international expansion.

Educated in Computer Science and Electrical Engineering in Paris, France, Chatelain pursued his business education at Emory University’s Goizueta Executive Business School. He is bilingual in French and English, completing additional studies in Russian and classical Greek.

Presentations

Executive Briefing: Becoming a data-driven enterprise—A maturity model Session

A data-driven enterprise maximizes the value of its data. But how do enterprises emerging from technology and organization silos get there? We use our experience helping our clients through this journey to create a data-driven enterprise maturity model that spans technology and business requirements. We will walk through use cases that bring the model to life.

Marty is the Director for Solution Architecture (EMEA) at Arundo. Prior to Arundo, he led software development at Europe’s biggest renewable energy company (Statkraft). His background is in software engineering, with a specialization in industrial applications.

In his spare time, Marty competes professionally around the world racing superbikes. He has also developing his own technical platform that is used in top teams around Europe to develop riders’ skills.

Presentations

Real-Time Motorcycle Racing Optimization Session

In motorcycle racing, riders make snap decisions that determine outcomes spanning from success to grievous injury. Using a custom software-based edge agent and machine learning, we automate real-time maneuvering decisions in order to limit tire slip during races, thereby mitigating risk and enhance competitive advantage.

Ira Cohen is a cofounder of Anodot and its chief data scientist, where he is responsible for developing and inventing its real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

The App Trap: Why Every Mobile App and Mobile Operator Needs Anomaly Detection Session

The mobile world has so many moving parts, so a simple change to one element can cause havoc somewhere else. Resulting issues can annoy users and cause revenue leaks. This presentation will discuss ways to use anomaly detection to track everything mobile, from the service and roaming to specific apps, to fully optimize your mobile offerings.

Darren Cook is technical director at QQ Trend, a financial data analysis and data products company. Darren has over 25 years of experience as a software developer, data analyst, and technical director and has worked on everything from financial trading systems to NLP, data visualization tools, and PR websites for some of the world’s largest brands. He is skilled in a wide range of computer languages, including R, C++, PHP, JavaScript, and Python. Darren is the author of two books, Data Push Apps with HTML5 SSE and Practical Machine Learning with H2O, both from O’Reilly, as well as a Coursera course on machine learning and H2O.

Presentations

Using LSTMs To Aid Professional Translators Session

Using LSTMs, state-of-the-art tokenizers, dictionaries and other data sources to tackle machine learning, focusing on one of the most difficult language pairs: Japanese to English.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the first day of keynotes.

Paul Curtis is a principal solutions engineer at MapR, where he provides pre- and postsales technical support to MapR’s worldwide systems engineering team. Previously, Paul served as senior operations engineer for Unami, a startup founded to deliver on the promise of interactive TV for consumers, networks, and advertisers; systems manager for Spiral Universe, a company providing school administration software as a service; senior support engineer positions at Sun Microsystems; enterprise account technical management positions for both Netscape and FileNet; and roles in application development at Applix, IBM Service Bureau, and Ticketron. Paul got started in the ancient personal computing days; he began his first full-time programming job on the day the IBM PC was introduced.

Presentations

Making Stateless Containers Reliable and Available Even with Stateful Applications Session

The flexibility advantage conferred by containers depends on their ephemeral nature, so it’s useful to keep containers stateless. However, many applications require state, so access to a scalable persistence layer that supports real mutable files, tables and streams. This talk shows how to make containerized applications reliable, available and performant, even with stateful applications.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the first day of keynotes.

Danielle Dean, PhD is a Principal Data Scientist Lead at Microsoft Corp. in the Algorithms and Data Science Group within the Artificial Intelligence & Research division. She currently leads an international team of data scientists and engineers to build predictive analytics and machine learning solutions with external companies utilizing Microsoft’s Cloud AI Platform. Before working at Microsoft, Danielle was a data scientist at Nokia, where she produced business value and insights from big data, through data mining & statistical modeling on data-driven projects that impacted a range of businesses, products and initiatives.

Danielle completed her Ph.D. in quantitative psychology with a concentration in biostatistics at the University of North Carolina at Chapel Hill, where she studied the application of multi-level event history models to understand the timing and processes leading to events between dyads within social networks.

Presentations

Executive Briefing: Lessons learned managing data science projects - Adopting a team data science process Session

This presentation covers the basics of managing data science projects, including the data science lifecycle and overview of one example approach that is adopted internally at Microsoft called the "Team Data Science Process" (TDSP). Learn more about the typical priorities of data science teams and the keys to success on engaging and creating value with data science.

BIO BAIJU – Vice President Enterprise Analytics

At Aviva Canada, Baiju leads a team of data scientists in application of analytics across all aspects of the insurance business: from core insurance activities such as pricing and risk selection to development of cutting edge robotic processes and application of machine learning algorithms to areas such as claims processing and optimizing client acquisition.

Prior to Aviva, Baiju led the analytics group at IIROC, an entity overseeing Canadian equity & fixed-income markets and ingesting 600M+ data points daily and in realtime. The analytics group at IIROC guided decision making in todays machine driven (algorithmic) markets and embedded machine learning and other algorithmic surveillance capabilities for market oversight. In addition Baiju spearheaded IIROC’s primary research into market-structure issues as well emerging technologies such as Blockchains.

Baiju was also part of the early team at OANDA, a FinTech startup that disrupted retail foreign-exchange markets, and that attracted one of the largest funding rounds in Canada of $100M from leading silicon valley venture firms. Baiju was responsible for building and leading the data engineering, data sciences and growth teams at OANDA. He is also a founder of Fstream, a SaaS provider for ingesting and analyzing high-frequency streaming data.

Baiju has a BSc and MSc in Computer Science from Queen’s university and developed his data chops working on large biological datasets as part of his graduate work and later at the Ontario Cancer Institute.

Presentations

Risk sharing pools: winning zero-sum games through machine learning Session

Risk sharing pools allow insurers to get rid of risks they are forced to insure in highly regulated markets. Insurers thus cede both the risk and its premium. But are we ceding the right risk or simply giving up premium ? We present an applied machine learning talk that leverages an ensemble of models and allows us to to get a distinctive market advantage and win through machine-learning.

Thomas W. Dinsmore is director of product marketing for Cloudera Data Science. Previously, he served as a knowledge expert on the strategic analytics team at the Boston Consulting Group; director of product management for Revolution Analytics; analytics solution architect at IBM Big Data Solutions; and a consultant at SAS, PricewaterhouseCoopers, and Oliver Wyman. Thomas has led or contributed to analytic solutions for more than five hundred clients across vertical markets and around the world, including AT&T, Banco Santander, Citibank, Dell, J.C.Penney, Monsanto, Morgan Stanley, Office Depot, Sony, Staples, United Health Group, UBS, and Vodafone. His international experience includes work for clients in the United States, Puerto Rico, Canada, Mexico, Venezuela, Brazil, Chile, the United Kingdom, Belgium, Spain, Italy, Turkey, Israel, Malaysia, and Singapore.

Presentations

A Roadmap for Open Data Science Session

Data science transforms the organization. But executives struggle to build a culture of open data science, and transition from legacy commercial analytic tools. There are clear best practices organizations can use to accelerate adoption and success with open data science. We propose a model that helps organizations begin the journey, build momentum, and reduce reliance on legacy software.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive Briefing: GDPR - Getting Your Data Ready for Heavy New EU Privacy Regulations Session

General Data Protection Regulation (GDPR) goes into effect in May 2018 for firms doing any business in the EU. However many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.

Securing and governing hybrid, cloud and on-prem big data deployments: step-by-step Tutorial

"Hybrid big data deployments present significant new security risks that need to be managed. It's encumbent upon security admins to ensure a consistently secured and governed experience for end users and administrators across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges."

Ted Dunning is chief applications architect at MapR Technologies. He’s also a board member for the Apache Software Foundation, a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects, and a mentor for various incubator projects. He also designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date. He holds a PhD in computing science from the University of Sheffield. He is on Twitter as @ted_dunning.

Presentations

Rendezvous with AI Session

No matter how clever your learning algorithms two things will still be true - data and deployment logistics will dominate the effort - you will need more than 2 versions of your model, even in full production I will describe the rendezvous architecture and show how addresses these issues and many more, thus allowing more time to be spent thinking and doing real data science.

Radhika is a product executive who has participated in 4 exits, two of which were companies she founded. The first startup she co-founded was Lobby7, a venture-backed company that created an early version of Siri back in 2000 and was acquired by Scansoft/Nuance. She later worked at Avid, growing their broadcast business by building a product suite to address pain points of broadcasters worldwide as they were moving from tape to digital media. She then led strategy at the telecom startup, Starent Networks, later acquired by Cisco for $2.9B. She left Cisco to start Likelii to offer consumers “Pandora for wine”. Likelii was later acquired by Drync. Recently she led Product Management at Allant to build a SaaS product for TV advertising. Allant’s TV division was subsequently acquired by Acxiom. Too long ago to admit, Radhika graduated from MIT with an SB and M.Eng in Electrical Engineering, and speaks 9 languages. You can follow her on the Medium publication, Radical Product and on Twitter @radhikadutt.

Presentations

Measure What Matters: How your measurement strategy can reduce OpEx Tutorial

These days it’s easy for companies to say, "We measure everything!” The problem is, most “popular” metrics may not be appropriate or relevant for your business. Measurement isn’t free, and should be done strategically. This session covers how you can align measurement with your product strategy, so you can measure what matters for your business.

Erik Elgersma is Director of Strategic Analysis at FrieslandCampina (€ 12 bn sales, one of the world’s largest dairy companies). He speaks and lectures frequently on the topics of strategic analysis and data management. He is based in the Netherlands.
Working with FrieslandCampina since 1999 holding positions in Corporate Strategy, Business Development and Innovation Management, Erik has lead their strategic analysis practice for 18 years.
Based in the Netherlands, Erik shares his extensive knowledge of data science and competitive intelligence in speaking engagements and presentations.
Previous clients include:
• Royal Dutch Association of Information Professionals
• The US Navy, Centre for Naval Analysis
• Cimi.Con Annual Conference
. Institute for Competitive Intelligence, Annual Conference
IAFIE-Europe conference
• Brunel University, London
Erik is qualified for guest lectures in business, as well as speakings and conferences on competitive advantage and intelligence; data sciences; strategic analysis; and war-gaming."

Presentations

Predictive analytics for FMCG business strategies and tactics Session

How a global food company uses both multiple source of monthly actualized data and smart algorithms to predict future commodity prices as input for its commercial policies.

Nick Elprin is the CEO and co-founder of Domino Data Lab, a data science platform that accelerates the development and deployment of models while enabling best practices like collaboration and reproducibility. Before staring Domino, Nick build tools for quantitative researchers at Bridgewater, one of the world’s largest hedge funds. He has over a decade of experience working with data scientists at advanced enterprises. He has a BA and MS in computer science from Harvard.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending: accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprises’ KPIs. You’ll learn how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Olga Ermolin is a Senior Business Intelligence Engineer
at MLS Listings, Inc. She is responsible for standardization the schema of real-estate database across multiple real-estate hosting companies as well as for maintaining day-to-day data integrity and scalability. She is behind company’s BI product that enabled clients to visualize and analyze real-estate trends and performance.

Presentations

Using Siamese CNNs for Removing Duplicate Entries From Real-Estate Listing Databases Session

Aggregation of geo-specific real-estate databases results in duplicate entries for properties located near geographical boundaries. We present an approach of identifying duplicate entries via the analysis of images that accompany real-estate listing leveraging transfer learning Siamese architecture based on VGG-16 CNN topology

Javier “Xavi” Esplugas is the vice president of IT planning and architecture at DHL Supply Chain. Xavi has served in a number of roles at DHL Supply Chain. Previously, he drove the standardization and innovation agenda in Europe, which included DHL’s vision picking, robotics, and the internet of things. Xavi holds an MSC in computer engineering from Universitat Politècnica de Catalunya in Barcelona.

Presentations

IoT improves center of activity: DHL increases efficiency and reduces distance traveled across the warehouse Session

As an essential component of DHL’s IoT initiative, Conduce visualizations track and analyze distance traveled by personnel and warehouse equipment, all calibrated around a center of activity. Using immersive operational data visualization, DHL is gaining unprecedented insight to evaluate and act on everything that occurs in its warehouses.

Moty Fania owns development and architecture in the advanced analytics group within Intel IT. With over 13 years of experience in analytics, data warehousing, and decision support solutions, Moty drives the overall technology and architectural roadmap for big data analytics in Intel IT. Moty is also the architect behind Intel’s IoT big data analytics platform. He holds a bachelor’s degree in computer science and economics and a master’s degree in business administration from Ben-Gurion University in Israel.

Presentations

A high-performance system for deep learning inference and visual inspection Session

In this session, Moty Fania will share Intel’s IT experience from implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming and online actuation. This session highlights the key learnings from this work with a thorough review of platform’s architecture

Danyel Fisher is a Senior Researcher in information visualization and human-computer interaction at Microsoft Research’s VIBE group. His research focuses on ways to help users interact with data more easily. His recent work studies big data, sequential data, and end-user visualization. Danyel received his MS from UC Berkeley, and his PhD from UC Irvine.

Presentations

Making Data Visual: A Practical Session on Using Visualization for Insight Tutorial

How do you derive insight from data? This session teaches about the human side of data analysis and visualization. We'll discuss operationalization, the process of reducing vauge problems to specific tasks, and how to choose a visual representation that addresses those tasks. We'll discuss single views, and how to link them into multiple views.

I manage large chunks of The Data Lab, including our Project Development, Contracting and HR teams, and am responsible for all aspects of The Data Lab’s sponsored project delivery. I lead a range of strategic projects for The Data Lab, including the Cancer Innovation Challenge.

Presentations

How can data help treat cancer? Lessons from Scotland's Cancer Innovation Challenge Session

Scotland has some of the world's best cancer data, and some of the world's best data scientists and data companies. But what Scotland doesn't have, is very good cancer outcomes. So how can we use data, our skills and our networks to deliver better cancer treatments and results? This is what we've learned from two years delivering Scotland's Cancer Innovation Challenge.

Jeff graduated from Witwatersrand University with a degree in electrical engineering, and has been involved in Internet technology all his professional life, but with a strong commercial bent. He started his career at Telkom in 1994, working on the initial Internet infrastructure team and managing aspects of the Johannesburg Beltel installation. Between stints at Sprint (which became UUNET which became Verizon Business), where he designed and implemented new Internet products and services, Jeff Fletcher founded Antfarm Networking Technologies, South Africa’s first streaming and webcasting company. He returned to the corporate world in 2004, occupying the corner office of the new product development team at Internet Solutions (then IS). Jeff now works as a Systems Engineer for Cloudera, helping customers build big data infrastructure. In 2012 he founded www.limn.co.za, a blog dedicated to the art of data visualisation and does occasional presentations for people looking to move beyond pie charts. He was shortlisted for an Information is Beautiful award in 2015.

Presentations

Data Visualisation in a Big Data World Session

As big data adoption grows, Apache Hadoop, Apache Spark and machine learning technologies are increasingly being used to analyse ever larger datasets. But we still have to keep telling stories about the data and making sure the message is clear. This talk will cover the tools and techniques that are relevant to data visualisation practitioners working with large data sets and predictive models.

Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

Running Data Analytic Workloads on the Cloud Tutorial

Cloud delivers solutions to single, multi-purpose clusters offering hyperscale storage, decoupled from elastic, on-demand computing. Join us to discuss the new paradigms to effectively run production level pipelines with minimal operational overhead. Remove barriers to data discovery, meta-data sharing, and access control.

Michael Freeman is a Lecturer at the University of Washington Information School where he teaches courses in data science, interactive data visualization, and web development. Prior to his teaching career, he worked as a data visualization specialist and research fellow at the Institute for Health Metrics and Evaluation. There, he performed quantitative global health research and built a variety of interactive visualization systems to help researchers and the public explore global health trends.

Michael is interested in applications of data visualization to social change, and holds a Master’s in Public Health from the University of Washington. See more of Michael’s work on his personal website.

Presentations

Visually Communicating Statistical and Machine Learning Methods Session

Statistical and machine learning techniques are only useful when they're understood by decision makers. While implementing these techniques is easier than ever, communicating about their assumptions and mechanics is not. In this session, participants will learn a design process for crafting visual explanations of analytical techniques and communicating them to stakeholders.

Barbara is a Data Scientist with strong software development background. While working with a variety of different companies, she gained experience in building diverse software systems. This experience brought her focus to the Data Science and Big Data field. She believes in the importance of the data and metrics when growing a successful business.

Alongside collaborating around data architectures, Barbara still enjoys programming activities. Currently speaking at conferences in-between working in London. Tweets at @BasiaFusinska and blogs on http://barbarafusinska.com.

Presentations

Introduction to Natural Language Processing with Python Tutorial

Natural Language Processing techniques allow addressing tasks like text classification and information extraction and content generation. In this session, Barbara will walk the audience through the process of building the bag of words representation and using it for text classification. The goal of this tutorial is to build the intuition on the simple natural language processing task.

Aurélien Géron is a machine-learning consultant. Previously, he led YouTube’s video classification team and was founder and CTO of two successful companies (a telco operator and a strategy firm). Aurélien is the author of several technical books, including the O’Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow.

Presentations

Deep computer vision for manufacturing Session

Convolutional Neural Networks (CNN) are now capable of superhuman-level in many computer vision tasks. This is will have a large impact in manufacturing, by improving anomaly detection, product classification, analytics, and more. In this talk, I will present the main CNN architectures, how they can be applied to manufacturing, and what challenges they will face.

Naveed Ghaffar is a serial entrepreneur and innovator in data management and data science technologies. He has founded 3 start-ups in this domain over the past 6 years, and he continues to coach and mentor a number of London based start-ups.

Naveed is the co-founder of Narrative Economics, a start-up company that is leading research and development in the emerging field of Natural Language Understanding as it pertains to the spread of popular narratives across the world.

More recently, Naveed has served as Chief Engineer for KPMG McLaren between January 2016 to September 2017, where he was responsible for overall product management for the company’s suite of data analytics and simulation solutions.

Naveed has a background in data analytics and data governance. He is recognised as a thought leader on Privacy by Design, Design Thinking and the policy and technical effects of the EU GDPR regulations.

Naveed holds a degree in Law (LLB Honors) and a Masters in Computer Science (MSc) from the University of Birmingham. He is a Certified Scrum Master, Design Thinking Coach, and Jobs-to-be-Done Innovator.

Presentations

Narrative Extraction – Analysing the World’s Narratives through Natural Language Understanding Session

Narratives are significant vectors of rapid change in culture, in zeitgeist, in economic behaviour. Introduced formally by Professor Robert Shiller in 2017, Narrative Economics studies the impact of popular human-interest stories on economic fluctuations. We present a framework that uses Natural Language Understanding for extracting and analysing narratives in human communication.

A former Data Science Consultant who has worked with several publishers in the UK and US.

Presentations

Revolutionising the newsroom with artificial intelligence. Session

In the era of 24-hour news and online newspapers, editors in the newsroom must be able to make fast decisions about their content and must quickly and efficiently make sense of the enormous amounts of data that they encounter. We will discuss an ongoing partnership between News UK and Pivigo in which a team of data science trainees helped develop an AI platform to help in this task.

Sean is a Software Engineer on the Fast Data Platform team at Lightbend where he specializes in Apache Kafka and its ecosystem. The Fast Data Platform team is building the next generation big data platform distribution with a focus on stream processors, machine learning, and operations ease of use. Sean has several years experience consulting for Global 5000 companies and helping them build data streaming platforms using technologies such as Kafka, Spark, and Akka.

Presentations

Kafka in jail. Running Kafka in container orchestrated clusters. Session

Kafka is best suited to run close to the metal on dedicated machines in statically defined clusters. What are the pros and cons of running containerized Kafka In the age of mixed-use clusters? Learn about techniques for running Kafka while also supporting service migration in shared resource environments such as DC/OS (Mesos) and Kubernetes.

I am currently a researcher at Télécom ParisTech. My main research area is Machine Learning, specially Evolving Data Streams, Concept Drift, Ensemble methods and Big Data Streams. I co-lead the StreamDM open data stream mining project.

Presentations

StreamDM: Advanced data science with Spark Streaming Session

We present StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei Noah’s Ark Lab and Telecom ParisTech.

Miguel González-Fierro is a Senior Data Scientist at Microsoft UK, where his job consists of helping customers leverage their processes using Big Data and Machine Learning. Previously, he was CEO and founder of Samsamia Technologies, a company that created a visual search engine for fashion items allowing users to find products using images instead of words, and founder of the Robotics Society of Universidad Carlos III, which developed different projects related to UAVs, mobile robots, small humanoids competitions, and 3D printers. Miguel also worked as a robotics scientist at Universidad Carlos III of Madrid and King’s College London, where his research focused on learning from demonstration, reinforcement learning, computer vision, and dynamic control of humanoid robots. He holds a BSc and MSc in electrical engineering and an MSc and PhD in robotics.

Presentations

Distributed Training of Deep Learning Models Session

In this talk we will present two platforms for running distributed deep learning training in the cloud. We will train a ResNet network on ImageNet dataset using some of the most popular deep learning frameworks. We will then compare and contrast the performance improvement as we scale the number of nodes as well as provides tips and details of the pitfalls of each framework and platform.

I am passionate about innovation and the transformation of business values and customers’ needs into new technical solutions. I can combine a highly technical expertise with accurate planning and leading capabilities. I feel fully satisfied when I see successful results applied to real customers. I love taking responsibility and leading groups to succeed towards a common goal.

I am open and flexible to take different roles and adapt to different working methodologies.

Presentations

Big Data, Big Quality: Data Quality at Spotify Session

I will drive you through the process held at Spotify in terms of data quality: from why and how we became aware of its importance, to which main products we have developed. The presentation will conclude with details on where Spotify is now and what is our future strategy.

Martin Goodson is the chief scientist and CEO of Evolution AI, where he specializes in large-scale natural language processing. Martin has designed data science products that are in use at companies like Dun&Bradstreet, Time Inc., John Lewis and Condé Nast. Previously, Martin worked as a statistician at the University of Oxford, where he conducted research on statistical matching problems for DNA sequences.

Presentations

On the Limits of Decision-Making with Artificial Intelligence Session

How can AI become part of our business processes? Should we entrust critical decisions to completely autonomous systems? I’ll illustrate how to increase confidence in AI systems and manage the transition to an AI-driven organisation. Examples will be drawn from projects running in enterprise businesses and UK government agencies.

Charaka Goonatilake is CTO at Panaseer where he has designed and delivered big data solutions for Chief Information Security Officers and their teams to gain visibility into the true state of security within their business in order to improve cyber hygiene and reduce cyber risk exposure. He has been immersed in big data technologies since the very early days of Hadoop, giving him hands-on experience of making Hadoop work in the enterprise to produce data-driven insights. Over the past 8 years, across Panaseer and BAE Systems Applied Intelligence, Charaka has architected and engineered Hadoop-based data platforms for a range of cyber security use cases from security analytics for threat detection, threat intelligence management and cyber security risk management.

Presentations

Architecting Data Platforms for Cyber Security Session

With cyber security becoming a top priority for all modern digital businesses, data is becoming a crucial weapon to secure an organisation against cyber threats. We'll explore strategies for designing effective data platforms for cyber security using big data technologies, such as Spark and Hadoop, and discover how these platforms are being used in real-world examples of data-driven security.

Richard came to Big Data as a user; running data intensive departments, credit portfolio management and then performance management and compensation, for Canada’s largest bank. Since 2015 he has worked with IMC to bring AI solutions to companies that change the way they think about and use their Big Data.

Before all this he worked in consulting with BCG and in financial services in the UK and Canada. Richard has degrees from the University of Oxford and INSEAD.

Presentations

Blind Men & Elephants: What’s Missing from Your Big Data? Session

Analytics using big data tends to focus on what is easily available, which is, by and large, data about what has already happened. The implicit assumption is that past behaviour will predict future behaviour. In this presentation we will show how organizations already possess data they aren’t exploiting that, with the right tools, can be used to develop far more powerful predictive algorithms.

Matthias Graunitz (AUDI AG) works as an Architect at Audis Competence Center for Big Data & Business Intelligence. AUDI AG is a German automobile manufacturer that designs, engineers, produces, markets and distributes luxury vehicles. Audi is a member of the Volkswagen Group and has its roots at Ingolstadt, Bavaria, Germany. Audi-branded vehicles are produced in nine production facilities worldwide. Matthias has 10 years+ experience in the field of Business Intelligence and Big Data. He is responsible for the architectural framework of the Hadoop Ecosystem, a separate Kafka Cluster as well as for the data science tool kits provided by the Center of Competence for all business departments at Audi.

Presentations

Audi's journey to an enterprise big data platform Session

This talk is about Audi's journey from a first Hadoop PoC to a multi-tenant enterpise platform. We share our experiences gained on that journey, explain the decisions we had to make and show how some use cases are implemented using our platform.

Jacinta Greenwell works at the Australian Treasury where she has provided advice on a wide range of economic policy issues including disability and social inclusion, government budgets, demographics and the labour market. Jacinta is honing her data science skills and loving every minute of it. She has a particular interest in creating data visualisation products which she believes can bring economic and policy stories to life. Jacinta is keen to promote the visibility of women in mathematics, economic modelling and data science.

Presentations

Leveraging public-private partnerships using data analytics for economic insights Session

In October 2017, LinkedIn and the Australian Treasury teamed up to gain a deeper understanding of the Australian labour market through new data insights which may inform economic policy and directly benefit society. This presentation shares some of the discoveries, together with the practicalities of working in a public-private partnership.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Big Data at Speed Session

There are a lot of details that go into building a big data system for speed. What is a respectable latency until data access, how to solve multi-region problem, where to store the data, how to know what data you have, and where does stream processing fit in. In this session, we walk through our experiences and lessons learned from seeing implementations in the wild.

Democratizing data within your organization Session

Sure, you’ve got the best & fastest running SQL engine but you’ve still got some problems - users don’t know which tables exist, what they contain. Sometimes bad things happen to your data and you need to re-generate some partitions but there is no tool to do so. This talk goes into details how to make your team and the larger organization more productive when it comes to consuming data.

Sijie Guo is the cofounder of Streamlio, a company focused on building a next-generation real-time data stack. Previously, he was the tech lead for messaging group at Twitter, where he cocreated Apache DistributedLog, and worked on push notification infrastructure at Yahoo. He is the PMC chair of Apache BookKeeper.

Presentations

Modern real-time streaming architectures Tutorial

The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. In this tutorial, we present an in-depth overview of state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him on Twitter: @YufengG

Presentations

Getting up and running with TensorFlow Tutorial

In this session, you will learn how to train a machine-learning system using TensorFlow, a popular open source ML library. Starting from conceptual overviews, we will build all the way up to complex classifiers. You’ll gain insight into deep learning and how it can apply to complex problems in science and industry.

Tech. Data. Housing. Local Gov. Delivery.

Presentations

Predicting rent arrears: leveraging data science in the public sector Session

Social housing provides secure, low-cost housing options to those most in need. One major challenge to this programme is determining how best to target interventions when tenants fall behind on rent payments. We will discuss a recent project in which a team of data scientist trainees helped Hackney Council devise a more efficient, targeted strategy to detect and prioritise such situations.

Phil Harvey is passionate about data and people. Empathy is the key data skill. He’s also a big beardy geek.

Presentations

Successful Data Cultures: Inclusivity, empathy, retention, and results Session

Our lives are being transformed by data. Our work, our play and our health are now understood in new ways. Every organisation can take advantage of this resource. But something is holding us back; us! This talk discusses how to build a successful data culture. To embed data at the heart of every organization through people. How empathy, communication and humanity delivers success.

Kaylea Haynes is a Data Scientist at Peak, a Manchester-based data analytics service that helps companies grow revenue and profits using data and machine learning. Since working at Peak, Kaylea has focused on developing techniques for demand forecasting. Kaylea has a PhD in Statistics and Operational Research from Lancaster University, with her thesis “Detecting Abrupt Changes in Big Data.” Kaylea is also a member of the Royal Statistical Society and co-organises R Ladies Manchester.

Presentations

The “Ins and Outs” of forecasting in a hire business Session

Deciding how much stock to hold is a challenge for hire businesses. There is a fine balance between holding enough stock to fulfil hires and not holding too much stock so that overall utilisation is too low to achieve the return on investment. In this talk, we will describe a case study that we worked on involving forecasting the demand for thousands of assets across multiple locations.

Olaf Hein, Principal Consultant and Team Lead at ORDIX AG

With more than 20 years working in the IT industry, Olaf has earned experiences as architect, developer, administrator, trainer and project manager in many different areas. Storing and processing huge amounts of data, was always a focal point of his work. At ORDIX AG, he is responsible for Big Data technologies and solutions. He has built up a powerful team of Big Data consultants, created several training courses, speaks at conferences and regularly publishes technical articles.

Presentations

Fast analytics on fast data - Kudu as storage layer for banking applications Session

To fasten business processes, a large German bank relies mainly on a Kudu based data platform. This session highlights the key data access patterns and the architecture of this system. After nearly one year in production Olaf Hein will also share his experiences with Kudu in development and operations.

  • 2007 ~ Present: Software Engineer at Navercorp

Presentations

Analyzing 100B rows 
in real-time on Druid 
@naver.com Session

naver.com is the largest search engine in Korea, having +70% market share, serves several Billions Page View per day. My team successfully built real-time MOLAP system using Druid over 100B rows. In this presentation, We'll present our goal, problems during development, techniques for speed-up, how we extend Druid, and our architecture.

Carsten works as a Big Data Architect at Audi Business Innovation GmbH. Audi Business Innovation GmbH, a subsidiary of Audi, is a small company focused on developping new mobility services as well as innovative IT solutions for Audi. Carsten has more than 10 year experience in delivering Data Warehouse and BI solutions to his customers. He started working with Hadoop in 2013 and since then he has focused on both big data infrastructure and solutions. Currently Carsten is helping Audi to build up their Big Data platform based on Hadoop and Kafka. Further, as an solution architect he is responsible for developing and running the first analytical applications on that platform.

Presentations

Audi's journey to an enterprise big data platform Session

This talk is about Audi's journey from a first Hadoop PoC to a multi-tenant enterpise platform. We share our experiences gained on that journey, explain the decisions we had to make and show how some use cases are implemented using our platform.

Christine Hung leads the data solutions team at Spotify, which collaborates with business groups across the company to build scalable analytics solutions and provide strategic business insights. Previously, Christine ran the data science and engineering team at the New York Times, where her team partnered closely with the newsroom to build audience development tools and predictive algorithms to drive performance; she was also head of sales analytics for iTunes at Apple and a business analyst at McKinsey & Company. Christine grew up in Taiwan and currently lives in Manhattan with her family. She holds an MBA from Stanford Business School.

Presentations

Keynote with Christine Hung Keynote

Christine Hung, Spotify

Paul Ibberson has worked with Teradata technology for over 20 years, covering most aspects of building data warehouses and the broader analytical ecosystem.

He has performed many roles in Teradata, supporting pre- and post-sales activities in the Financial Services, Utilities, Oil & Gas and Government industries, leading deliver teams in green field customer implementations, from shaping requirements through to design, build, and implementation into service.

As a current member of the Teradata International Ecosystem Architecture COE, he helps leading organizations to get the most value out of the latest advances in data warehousing and big data landscape. He has also presented at technical and customer conferences around the global, as well as jointly presented with a customer at Teradata Partners conference.

Paul’s working experience includes working with clients across the International region including Nordic countries, Benelux, Germany, Austria, Australia and South Africa as well as the UK.

Paul holds a BSc degree in Computer Science and Engineering from Manchester Metropolitan University, UK. He was awarded the Teradata Consulting Excellence Award in 2013.

Presentations

Driving better predictions in the Oil and Gas Industry with modern data architecture Session

Oil exploration and production is technically challenging. And exploiting the associated data - petabyte seismic surveys, real time sensor data, 3D models of the subsurface - brings its own challenges too. Oil companies are now looking to big data analytics to help them make better predictions, but how do we marry the specialist functionality of the old siloed systems with a new big data world?

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Solving data cleaning and unification using human-guided machine learning Session

Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas provides insight into various techniques and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution

Rashed Iqbal, Ph. D is a Program Manager, Data Solutions at Teledyne Technologies in California. He has two areas of interest and expertise: Data Science and Machine Learning, and Transitioning Traditional Organisation to Agile and Lean Methods. He practices, consults, and teaches in these domains.

He is an Adjunct Professor in Economics Department at UCLA where he teaches graduate courses in Data Science. Rashed also teaches Deep learning and Natural Language Processing / Understanding at UC Irvine.

Rashed undertook multiple entrepreneurial ventures in these areas. His current area of research is Narrative Economics that studies impact of the popular narratives and stories on economic fluctuations. He is using Natural Language Processing/Understanding and Deep Learning to extract narratives in human communication. He believes narrative extraction will revolutionize process of human communication to which Narrative Economics is just one of the applications.

Rashed has a Ph.D. in Systems Engineering with focus on Stochastic and Predictive Systems and holds current CSM, CSP, PMI-ACP, and PMP certifications. He is also a Senior Member of the IEEE.

Presentations

Narrative Extraction – Analysing the World’s Narratives through Natural Language Understanding Session

Narratives are significant vectors of rapid change in culture, in zeitgeist, in economic behaviour. Introduced formally by Professor Robert Shiller in 2017, Narrative Economics studies the impact of popular human-interest stories on economic fluctuations. We present a framework that uses Natural Language Understanding for extracting and analysing narratives in human communication.

Kinnary Jangla is a senior software engineer on the homefeed team at Pinterest, where she works on the machine learning infrastructure team as a backend engineer. Kinnary has worked in the industry for 10+ years. Previously, she worked on maps and international growth at Uber and on Bing search at Microsoft. Kinnary holds an MS in computer science from the University of Illinois and a BE from the University of Mumbai. Additionally, Kinnary is an author of 2 published books.

Presentations

Accelerating development velocity of production ML systems with Docker Session

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Come learn about how Pinterest Dockerized the services powering the Home Feed and how it impacted the engineering productivity of our ML teams while increasing uptime and ease of deployment.

Flavio Junqueira leads the Pravega team at DellEMC. He is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, he held an engineering position with Confluent and research positions with Yahoo Research and Microsoft Research. Flavio is a contributor of Apache projects, such as Apache ZooKeeper (PMC and committer), Apache BookKeeper (PMC and committer), and Apache Kafka, and he coauthored the O’Reilly ZooKeeper book. Flavio holds a PhD degree in computer science from the University of California, San Diego.

Presentations

Stream scaling in Pravega Session

Stream processing is at the spotlight. Enabling low-latency insights and actions out of continuously generated data is compelling to a number of application domains. Critical to many such applications is the ability of adapting to workload variations, e.g., daily cycles. Pravega is a stream store that scales streams automatically and enables applications to scale downstream by signaling changes.

Member of the European Parliament , Head of the Hellenic Delegation for the Progressive Alliance of S&D, Chair Scientific Foresight Unit #STOA, Chair EU-NATO Delegation

Presentations

Keynote with Eva Kaili Keynote

Eva Kaili, Member of the European Parliament, Head of the Hellenic Delegation for the Progressive Alliance of S&D, Chair Scientific Foresight Unit #STOA

Amit Kapoor is interested in learning and teaching the craft of telling visual stories with data. At narrativeVIZ Consulting, Amit uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Amit also teaches storytelling with data as guest faculty in executive courses at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. He has more than 12 years of management consulting experience with AT Kearney in India, Booz & Company in Europe, and more recently for startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad. Find more about him at Amitkaps.com.

Presentations

Architectural Design for Interactive Visualization Session

Visualisation for data science requires an interactive visualisation setup which works at scale. In this talk, we will explore the key architectural design considerations for such a system and illustrate, using real-life examples, the four key tradeoffs in this design space - rendering for data scale, computation for interaction speed, adaptive to data complexity and responsive to data velocity.

Deep Learning in the Browser: Explorable Explanations, Model Inference & Rapid Prototyping Session

We showcase three live-demos of doing deep learning (DL) in the browser - for building explorable explanations to aid insight, for building model inference applications and even, for rapid prototyping and training ML model - using the emerging client-side Javascript libraries for DL.

I am currently leading the NLP & Data Science practice at Episource, a US healthcare company. My daily work revolves around working on semantic technologies and computational linguistics (NLP), building algorithms and machine learning models, researching data science journals and architecting secure product backends in the cloud.

Techstack that my team and I typically work on includes;

Language: Python
Testing Frameworks: unittest, pytest
Automation & Configuration Management: Ansible, Docker, Vagrant
CI: Travis CI
Cloud Services: AWS, Google Cloud, MS Azure
APIs: Bottle, CherryPy, Flask
Databases: MySQL, SQLite, MSSQL, RDF stores, Neo4J, ElasticSearch, MongoDB, Redis
Editor: Sublime, Pycharm

I have architected multiple commercial NLP solutions in the area of healthcare, foods & beverages, finance and retail. I am deeply involved in functionally architecting large scale business process automation & deep insights from structured & unstructured data using Natural Language Processing & Machine Learning. I have contributed to multiple NLP libraries like Gensim and Conceptnet5. I blog regularly on NLP on multiple forums like Data Science Central, LinkedIn and my blog Unlock Text.

I love teaching and mentoring students. I speak regularly on NLP and text analytics at conferences and meetups like Pycon India and PyData. I have also taught multiple hands-on session at IIM Lucknow and MDI Gurgaon. I have mentored students from schools like ISB Hyderabad, BITS Pilani, Madras School of Economics. When bored – I like to fall back on Asimov to lead me into an alternate reality.

Presentations

Building a Healthcare Decision Support System for ICD10/HCC Coding through Deep Learning Session

At Episource, we work on building Deep Learning frameworks and architectures to help summarize a medical chart, extract medical coding opportunities and their dependencies to recommend best possible ICD10 codes. This not only required building a wide variety of deep learning algorithms to account for natural language variations but also fairly complex in-house training data creation exercises

Holden is a trans Canadian open source developer advocate with a focus on Apache Beam, Spark, and related “big data” tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. Prior to joining Google as a Developer Advocate she worked at IBM, Alpine, Databricks, Google (yes this is her second time), Foursquare, and Amazon. She was tricked into the world of big data while trying to improve recommendation systems and has long since forgotten her original goal. Outside of work she enjoys playing with fire, riding scooters, and dancing.

Presentations

Understanding Spark Tuning with Auto Tuning (or magical spells to stop your pager going off at 2am*) Session

Apache Spark is an amazing distributed system, but part of the bargain we've all made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. This talk will look at auto-tuning jobs using both historical and live job information using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Ilia is a Data Scientist working on applying ML and deep-learning solutions in industry. He is particularly interested in statistical theory behind deep-learning. Ilia holds a MSc in Economics from London School of Economics.

Presentations

Distributed Training of Deep Learning Models Session

In this talk we will present two platforms for running distributed deep learning training in the cloud. We will train a ResNet network on ImageNet dataset using some of the most popular deep learning frameworks. We will then compare and contrast the performance improvement as we scale the number of nodes as well as provides tips and details of the pitfalls of each framework and platform.

Geordie Kaytes is a product strategist and design leader with deep experience in design process transformation. His cross-functional expertise in design, strategy, and technology has helped companies in a broad range of industries develop a 360-degree view of their product design processes.

Presentations

Measure What Matters: How your measurement strategy can reduce OpEx Tutorial

These days it’s easy for companies to say, "We measure everything!” The problem is, most “popular” metrics may not be appropriate or relevant for your business. Measurement isn’t free, and should be done strategically. This session covers how you can align measurement with your product strategy, so you can measure what matters for your business.

Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Correlation Analysis on Live Data Streams Session

The rate of growth of data volume and velocity has been accelerating. Further, the variety of data sources also has been growing. This poses a significant challenge in extracting actionable insights in a timely fashion. The talk focuses on how marrying correlation analysis with anomaly detection can help to this end. Also, robust techniques shall be discussed to guide effective decision making.

Modern real-time streaming architectures Tutorial

The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. In this tutorial, we present an in-depth overview of state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Ivan Kelly is a Software Engineer for Streamlio, a startup dedicated to providing a next generation integrated real-time stream processing solution, based on Heron, Apache Pulsar (incubating) and Apache BookKeeper. Ivan has been active in Apache BookKeeper since its very early days as a project in Yahoo! Research Barcelona. Specializing in replicated logging and transaction processing, he is currently focused on Streamlio’s storage layer.

Presentations

Multi-datacenter and multi-tenant durable messaging with Apache Pulsar Session

This talk introduces Apache Pulsar, a durable, distributed messaging system, underpinned by Apache BookKeeper which provides enterprise features necessary to guarantee that your data is where is should be and only accessible by those who should have access.

  • 2016 – 2017.1: Software Engineer at Coupang
  • 2017.2 ~ present: Software Engineer at Navercorp

Presentations

Analyzing 100B rows 
in real-time on Druid 
@naver.com Session

naver.com is the largest search engine in Korea, having +70% market share, serves several Billions Page View per day. My team successfully built real-time MOLAP system using Druid over 100B rows. In this presentation, We'll present our goal, problems during development, techniques for speed-up, how we extend Druid, and our architecture.

Eugene is a Staff Software Engineer on the Cloud Dataflow team at Google, currently working on the Apache Beam programming model and APIs. Previously he worked on Cloud Dataflow’s autoscaling and straggler elimination techniques. He is interested in programming language theory, data visualization, and machine learning.

Presentations

Radically modular data ingestion APIs in Apache Beam Session

Apache Beam offers users a novel programming model in which the classic batch/streaming dichotomy is erased, and ships with a rich set of IO connectors to popular storage systems. We describe Beam's philosophy for making these connectors flexible and modular, a key component of which is "Splittable DoFn" - a novel programming model primitive that unifies data ingestion between batch and streaming.

Olivia Klose is a software development engineer in the Technical Evangelism and Development Group at Microsoft, where she focuses on all analytics services on Microsoft Azure, in particular Hadoop (HDInsight), Spark, and Machine Learning. Olivia is a frequent speaker at conferences both in Germany and around the world, including TechEd Europe, PASS Summit, and Technical Summit. She studied computer science and mathematics at the University of Cambridge, the Technical University of Munich, and IIT Bombay, with a focus on machine learning in medical imaging.

Presentations

Detecting Small Scale Mines in Ghana Session

Computer vision is becoming one of the focus areas in artificial intelligence (AI) to enable computers to see and perceive like humans. In a collaboration with the Royal Holloway University, we applied deep learning to locate small scale mines in Ghana using satellite imagery, scaled training using Kubernetes and investigated their impact on surrounding populations & environment.

Kostas is a Flink Committer, currently working with data Artisans to make Apache Flink® the best open-source stream processing engine and your data’s best friend. Before joining data Artisans, Kostas was a postdoctoral researcher at IST in Lisbon and even before that he obtained a PhD in Computer Science from INRIA (France). He obtained his engineering diploma from NTUA in Athens and his main research focus was in cloud storage and distributed processing.

Presentations

Complex Event Processing with Apache Flink Session

Complex Event Processing (CEP) helps at detecting patterns over continuous streams of data. DNA sequencing, fraud detection, shipment tracking with specific characteristics (e.g. contaminated goods) and user activity analysis fall into this category. In this talk, we will present Flink's CEP library and the benefits of its integration with Flink.

Jorie Koster-Hale is a lead scientist at Dataiku, with expertise in neuroscience, healthcare data, and machine learning. Prior to joining Dataiku, she completed her Ph.D. in Cognitive Neuroscience at Massachusetts Institute of Technology and worked as a Postdoctoral Fellow at Harvard. Jorie currently resides in Paris, where she builds predictive models and eats pain au chocolat.

Presentations

Rent, rain, and regulations: leveraging structure in big data to predict criminal activity Session

Predicting crime poses a unique technical challenge - it is affected by many different geospatial and temporal features - weather, infrastructure, demographics, public events and government policy. Here, I use a combination of open source data, machine learning, time series modeling, and geostatistics to ask where crime will occur, what predicts it, and what we can do to prevent it in the future.

Aljoscha Krettek is a PMC member at Apache Flink and co-founder and software engineer at data Artisans. He studied Computer Science at TU Berlin, he has worked at IBM Germany and at the IBM Almaden Research Center in San Jose. In Flink, Aljoscha is mainly working on the Streaming API.

Presentations

Stream Processing for the Practitioner: Blueprints for common stream processing use cases with Apache Flink Session

When working with Apache Flink users, we see many different types of stream processing applications being implemented on top of Apache Flink. Over time, we noticed common patterns and saw how most streaming applications can be reduced to a few core archetypes or “application blueprints”. In this talk we will outline a few of these “stream processing application blueprints”.

Sanjeev Kulkarni is the cofounder of Streamlio, a company focused on building a next-generation real-time stack. Previously, he was the technical lead for real-time analytics at Twitter, where he cocreated Twitter Heron; worked at Locomatix handling the company’s engineering stack; and led several initiatives for the AdSense team at Google. Sanjeev holds an MS in computer science from the University of Wisconsin-Madison.

Presentations

Modern real-time streaming architectures Tutorial

The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. In this tutorial, we present an in-depth overview of state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Paul Lashmet is Practice Lead and Advisor for Financial Services at Arcadia Data. Arcadia Data provides visual big data analytics software that empowers business users to glean meaningful and real-time business insights from high volume and varied data in a timely, secure, and collaborative way. He writes extensively about the practical applications of emerging and innovative technologies to regulatory compliance. Previous to Arcadia Data, Paul led programs at HSBC, Deutsche Bank, and Fannie Mae.

Presentations

Real-Time Trade Surveillance Is Not Just About Trade Data Session

To fully demonstrate that policies and procedures comply with regulatory requirements, financial services organizations must use information that goes beyond traditional data sources. This talk will demonstrate how alternative data sources enhance trade surveillance by providing a deeper understanding of the intent of trade activities.

Francesca Lazzeri is a data scientist at Microsoft, where she is part of the algorithms and data science team. Francesca is passionate about innovations in big data technologies and the applications of advanced analytics to real-world problems. Her work focuses on the deployment of machine learning algorithms and web service-based solutions to solve real business problems for customers in the energy, retail, and HR analytics sectors. Previously, she was a research fellow in business economics at Harvard Business School. She holds a PhD in innovation management.

Presentations

Operationalize Deep Learning models for Fraud Detection with Azure Machine Learning Workbench Session

Advancements in computing technologies and e-commerce platforms have amplified the risk of online fraud. Failing to prevent fraud results in billions of dollars of loss for the financial industry. This trend has urged companies to consider AI techniques, including deep learning, for fraud detection. In this talk we show how to operationalize deep learning models with AzureML to prevent fraud.

Mike Lee Williams is director of research at Cloudera Fast Forward Labs, an applied machine intelligence lab in New York City, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Fast Forward Labs’s clients understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Interpretable machine learning products Session

Interpretable models result in more accurate, safer and more profitable machine learning products. But interpretability can be hard to ensure. In this talk, we'll look closely at the growing business case for interpretability, concrete applications including churn, finance and healthcare, and demonstrate the use of LIME, an open source, model-agnostic tool you can apply to your models today.

My work involves a combination of consultancy and hands-on work. As a consultant, I work with business partners to develop data science solutions that make the most of their data, including in-depth analysis of existing data and predictive analytics for future business needs. My hands-on work includes programming, mentoring and managing teams of data scientists on projects in a wide variety of business domains.

Presentations

Predicting rent arrears: leveraging data science in the public sector Session

Social housing provides secure, low-cost housing options to those most in need. One major challenge to this programme is determining how best to target interventions when tenants fall behind on rent payments. We will discuss a recent project in which a team of data scientist trainees helped Hackney Council devise a more efficient, targeted strategy to detect and prioritise such situations.

Revolutionising the newsroom with artificial intelligence. Session

In the era of 24-hour news and online newspapers, editors in the newsroom must be able to make fast decisions about their content and must quickly and efficiently make sense of the enormous amounts of data that they encounter. We will discuss an ongoing partnership between News UK and Pivigo in which a team of data science trainees helped develop an AI platform to help in this task.

Working as Lead Data Architect. Leading the entire Big Data implementation lifecycle. Founder of ReactoData, a Hadoop and Big Data startup located in Buenos Aires, Argentina.

I have experience with projects ranging from major applications for big companies to small “personal” applications and everything in between. Started my career many many years ago as Python and DBMS developer.

And last but not least, working in a personal project mixing stats, tennis and Machine Learning.

Presentations

Hadoop under attack. Securing data in a banking domain. Session

Why companies look at Hadoop ecosystem with some misgivings is the apparent difficulty of managing Hadoop compared with more traditional and propietary data products but the management of its security are getting more and more attention in the Hadoop space and particularly in Cloudera Stack. We will cover how we implemented it and point out the challenges and improvements on the process.

Michael Li founded the Data Incubator, a New York-based training program that turns talented PhDs from academia into workplace-ready data scientists and quants. The program is free to Fellows, and routinely accepts just 1% of applicants. Employers engage with the Incubator as hiring partners.

Previously, he worked as a data scientist (Foursquare), Wall Street quant (D.E. Shaw, J.P. Morgan), and a rocket scientist (NASA). He completed his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall Scholar. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup to focus on what he really loves.

Michael lives in New York, where he enjoys the Opera, rock climbing, and attending geeky data science events. You can find out more at http://www.thedataincubator.com or @thedatainc.

Presentations

Data, AI, and Innovation in the Enterprise Session

What are the latest initiatives and use cases around Data and AI within different corporations and industries? How are Data and AI reshaping different industries? What are some of the challenges of implementing AI within the enterprise setting? We’re convening four experts from different industries to answer these questions and more!

Dr. Audrey Lobo-Pulo is a Senior Advisor at the Australian Treasury and is an advocate for open government and the use of open source software in government modelling. Originally a Physicist working in high-speed data transmission, Audrey moved into economic modelling and has experience working across a wide range of policy issues including taxation, housing, social security, labor markets and population demographics. Audrey’s current passion is bringing data science to public policy analytics. She is also an advocate for inclusion and diversity in the workplace.

Presentations

Leveraging public-private partnerships using data analytics for economic insights Session

In October 2017, LinkedIn and the Australian Treasury teamed up to gain a deeper understanding of the Australian labour market through new data insights which may inform economic policy and directly benefit society. This presentation shares some of the discoveries, together with the practicalities of working in a public-private partnership.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the first day of keynotes.

Hollie is an Interaction Design Manager at Fjord, where she focuses on helping clients create products and services that excite audiences and drive engagement. Over the last ten years she has worked in the luxury, culture and publishing & telco sectors, collaborating closely with clients to create user centred designs for a wide range of digital products, from large scale collections systems to small innovative app projects.

She mentors early stage startups as part of the Google Launchpad program and junior UX designers through UXPA. When not designing she’s a keen traveller and instagram addict.

Her client list includes Google, Facebook, Net-A-Porter, BBC, Southbank Centre, The V&A, Paul Smith, Wellcome Collection, The National Theatre, Deep Mind and Roald Dahl.

Presentations

Designing Ethical Artificial Intelligence Session

There is no doubt that artificial intelligence systems are a powerful agent of change in our society. As this technology becomes increasingly prevalent, transforming our understanding of ourselves and our society, issues around ethics and regulation arise. In this session, we will explore how we can address fairness, accountability and the long-term effects on our society when designing with data.

Humanising Data — How to Find the Why Session

Data has opened up huge possibilities for analysing & customising services. We can now manage experiences to dynamically target audiences & respond immediately, however context is often missing. We will go though practical steps to help you find the why behind the data patterns you are seeing and how to decide what level of personalised service to create.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services.
Boris has 15 years’ experience in enterprise architecture and has been accountable for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies and Professional Hadoop Solutions, both from Wiley. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams Tutorial

This hands-on tutorial builds streaming apps as _microservices_ using Kafka with Akka Streams and Kafka Streams. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll be better informed when choosing tools for your needs. We'll contrast them with Spark Streaming and Flink, including when to chose them instead.

Paul leads the Product Engineering Team with Nordea Data Engineering. He is enthusiastic about how his work supports Data Engineering’s goal to deliver Nordea’s next generation data services and data platform. Prior to joining Nordea Paul worked for nine years in Investment Banks in London. Paul holds a MA in Psychology from the University of Dublin, Trinity College and a MPhil in Social & Development Psychology from the University of Cambridge. Outside of work his interests are supporting an educational charity, playing cello and sailing.

Presentations

How Nordea Reduced Time-to-Market by 85% with Modern Analytics Session

As a Global Systemically Important Bank (GSIB), Nordea Bank is subject to the highest level of regulatory oversight in the financial services industry. In this session, Nordea will explain how implementing new technology and process for acquiring, preparing and analyzing data helped reduce the bank’s compliance reporting processes by over 85%.

Angie Ma is cofounder and COO of ASI Data Science, a London-based AI tech startup that offers data science as a service, which has completed more than 120 commercial data science projects in multiple industries and sectors and is regarded as the EMEA-based leader in data science. Angie is passionate about real-world applications of machine learning that generate business value for companies and organizations and has experience delivering complex projects from prototyping to implementation. A physicist by training, Angie was previously a researcher in nanotechnology working on developing optical detection for medical diagnostics.

Presentations

Data science for managers 2-Day Training

Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Gerard Maas contributes to Lightbend Fast Data Platform as a Senior SW Engineer, where he focuses on the integration of stream processing technologies. Previously, he has held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms and scalable APIs.
He enjoys giving tech talks, contributing to small and large open source projects, tinkering with drones and building personal IoT projects.
Gerard is the co-author of ‘Learning Spark Streaming’, a book from O’Reilly Media.

Presentations

Processing Fast Data with Apache Spark: The Tale of Two APIs Session

Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. We will provide a critical view of their differences in key aspects of a streaming application: API user experience, dealing with time, dealing with state and machine learning capabilities. We will round up with practical guidance on picking one or combining both to implement resilient streaming pipelines.

Mark Madsen is a research analyst at Third Nature, where he advises companies on data strategy and technology planning. Mark spent most of the past 25 years working in the analytics field, starting in AI at the University of Pittsburgh and robotics at Carnegie Mellon University. Today he is president of Third Nature, where he advises global organizations on data strategy and technology infrastructure to support data science and analytics.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This tutorial covers design assumptions, design principles, and how to approach the architecture and planning for multi-use data infrastructure in IT.

Executive Briefing: BI on Big Data Session

If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. This briefing will present the tradeoffs between different architectures to provide self-service access to data.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Architecting a next generation data platform Tutorial

Using the Internet of Things and Customer 360 as an example, we’ll explain how to architect a modern, real-time data platform leveraging recent advancements in open-source software. We’ll show how components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines can enable new forms of data processing and analytics.

Big Data at Speed Session

There are a lot of details that go into building a big data system for speed. What is a respectable latency until data access, how to solve multi-region problem, where to store the data, how to know what data you have, and where does stream processing fit in. In this session, we walk through our experiences and lessons learned from seeing implementations in the wild.

I’ve been working with data for at least the last 5 years where I’ve worked with Data Analysis, Machine Learning, Data Engineering and Data Visualization.

Currently, I’m the Data Science Lead at Typeform helping other Data Scientists solve problems faster and making sure they are aligned.

Presentations

How Typeform's Data & Analytics team managed to embed its data scientists into cross-functional teams while maintaining their cohesion Session

At Typeform, the Data Team is growing into a less centralized structure, having data scientists embedded inside Product and Business teams. While changing the team’s structure, we introduced some initiatives to ensure alignment and cohesion. This session shows our journey through this challenging process, with key learnings, best practices and new processes established.

Dana Mastropole is a data scientist in residence at the Data Incubator and contributes to curriculum development and instruction. Previously, Dana taught elementary school science after completing MIT’s Kaufman teaching certificate program. She studied physics as an undergraduate student at Georgetown University and holds a master’s in physical oceanography from MIT.

Presentations

Machine learning with TensorFlow 2-Day Training

The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms.

Jaya Mathew is a Senior data scientist at Microsoft where she is part of the Artificial Intelligence and Research team. Her work focuses on the deployment of AI and ML solutions to solve real business problems for customers in multiple domains. Prior to joining Microsoft, she has worked with Nokia and Hewlett-Packard on various analytics and machine learning use cases. She holds an undergraduate as well as a graduate degree from the University of Texas at Austin in Mathematics and Statistics respectively.

Presentations

Operationalize Deep Learning models for Fraud Detection with Azure Machine Learning Workbench Session

Advancements in computing technologies and e-commerce platforms have amplified the risk of online fraud. Failing to prevent fraud results in billions of dollars of loss for the financial industry. This trend has urged companies to consider AI techniques, including deep learning, for fraud detection. In this talk we show how to operationalize deep learning models with AzureML to prevent fraud.

Originally from an IT background, Jane specialised in Oil & Gas with its specific information management needs back in 2000, and has been developing product, implementing solutions, consulting, blogging and presenting in this area since.

Jane has done time with the dominant market players – Landmark and Schlumberger – in R&D, product management, consulting, and sales – before joining Teradata in 2012. In one role or another, she has influenced information management projects for most major oil companies across Europe. She chaired the Education Committee for the European oil industry data management group ECIM, has written for Forbes, and regularly presents internationally at Oil Industry events.

As Practice Partner for Oil and Gas within Teradata’s Industrial IoT group, Jane is focused on working with Oil and Gas clients across the international region to show how analytics can provide strategic advantage, and business benefits in the multi-millions. Jane is also a member of Teradata’s IoT core team, setting the strategy and positioning for Teradata’s IoT offerings, and works closely with Teradata Labs to influence development of products and services for the Industrial space.

Jane holds a B. Eng. in Information Systems Engineering from Heriot-Watt University, UK. She is Scottish, and has a stereotypical love of single malt whisky.

Presentations

Driving better predictions in the Oil and Gas Industry with modern data architecture Session

Oil exploration and production is technically challenging. And exploiting the associated data - petabyte seismic surveys, real time sensor data, 3D models of the subsurface - brings its own challenges too. Oil companies are now looking to big data analytics to help them make better predictions, but how do we marry the specialist functionality of the old siloed systems with a new big data world?

Jude Mc Corry is Director of Business Development at the Data Lab in Scotland, she is responsible for delivering collaborative data science projects between industry and academia, whilst delivering value or job creation. She has also worked with public sector on data for good projects like Delayed Discharge, safe homes, and sees the huge potential for collaborating with data to achieve even bigger and better things!
Jude has over 15 years experience in Sales and Marketing in the Technology Sector working with B2B and public sector for companies like Dell, Firefly Communications and Xnet Data Storage. Jude has also worked in Academia at Edinburgh Napier University where she set up their commercial arm – The Edinburgh Institute – providing leading edge practice based executive education to Scottish Executives.

Presentations

Data Collaboratives Session

Data Collaboratives is a new form of collaboration, beyond the public-private partnership model, in which participants from different sectors —exchange data, skills, leadership and knowledge to solve complex problems facing children in Scotland and world wide.

Shaun McGirr is the Lead Data Scientist at Cox Automotive Data Solutions and has been working with data in one way or another for about 15 years. A recent PhD graduate from the University of Michigan, he spends most days developing new data products for the automotive industry.

Presentations

Scaling Data Science (Teams and Technologies) Session

Cox Automotive is the world’s largest automotive service organisation, and that means we can combine data from across the entire vehicle lifecycle. We are on a journey to turn this data into insights and want to share some of our experiences both in building up a data science team and scaling the data science process (from laptop to Hadoop cluster).

Born in Sardinia, Italy.

Passionate about applying data analytics and problem solving to real world problems. Mathematical Engineering and Statistics background (Politecnico di Milano university).

Lover of food, nature and travelling.

Currently working as a Data Scientist at Typeform.com, in sunny Barcelona (Spain).

Previously lived in Italy, USA, Brazil, and Chile.

Presentations

How Typeform's Data & Analytics team managed to embed its data scientists into cross-functional teams while maintaining their cohesion Session

At Typeform, the Data Team is growing into a less centralized structure, having data scientists embedded inside Product and Business teams. While changing the team’s structure, we introduced some initiatives to ensure alignment and cohesion. This session shows our journey through this challenging process, with key learnings, best practices and new processes established.

Miriah Meyer is an associate professor in the School of Computing at the University of Utah, where she runs the Visualization Design Lab. Her research focuses on the design of visualization systems for helping analysts and researchers make sense of complex data. Miriah was named a University of Utah distinguished alumni, a TED fellow, and a PopTech science fellow and was included on MIT Technology Review’s TR35 list of the top young innovators.

Presentations

Making Data Visual: A Practical Session on Using Visualization for Insight Tutorial

How do you derive insight from data? This session teaches about the human side of data analysis and visualization. We'll discuss operationalization, the process of reducing vauge problems to specific tasks, and how to choose a visual representation that addresses those tasks. We'll discuss single views, and how to link them into multiple views.

Grigorios has more than two years experience as a data scientist in easyJet. He has applied machine learning and statistical techniques to tackle a big variety of business problems and drive profits and savings within the company. He enjoys working with customers and contributing to the evolution and growth of his team. He holds a PhD in Electronic and Electrical Engineering from Imperial College and has an interdisciplinary background in Bayesian modelling and parallel computing.

Presentations

Data science survival and growth within the corporate jungle: An easyJet case study Session

In-house data science teams often work with a range of business functions, apply diverse techniques and face unpredictable hurdles related to requirements, data, infrastructure and deployment. Traditional data science processes are too abstract to cope with the complexity of these environments. This session will use recent project examples at easyJet to highlight how we overcame these challenges.

Fausto Morales is a Data Scientist at Arundo Analytics. At Arundo, Fausto works on product development and customer projects. Prior to Arundo, he worked at ExxonMobil where he worked on projects that included environmental remediation, product pricing and water treatment process modeling. He graduated from MIT in 2012 with a bachelor’s degree in civil engineering.

Presentations

Real-Time Motorcycle Racing Optimization Session

In motorcycle racing, riders make snap decisions that determine outcomes spanning from success to grievous injury. Using a custom software-based edge agent and machine learning, we automate real-time maneuvering decisions in order to limit tire slip during races, thereby mitigating risk and enhance competitive advantage.

Calum Murray is the chief data architect in the Small Business group at Intuit. Calum has 20 years’ experience in software development, primarily in the finance and small business spaces. Over his career, he has worked with various languages, technologies, and topologies to deliver everything from real-time payments platforms to business intelligence platforms.

Presentations

Machine learning at Intuit: 5 delightful use-cases Session

Machine learning based applications are becoming the new norm. Intuit is using the data of over 60 million users to create delightful experiences for customers by solving repetitive tasks, freeing them up to spend time more productively or solving very complex tasks with simplicity and elegance. This talk looks at 5 use cases at Intuit.

Advanced statistic competence in the field of business and economics. Focused in applying data exploration, building machine learning models to find hidden pattern inside of the data and interpreting them.

Presentations

Real-time monitoring discovering: a case study in automotive industry Session

Do you want to know if the products you make and sell are used efficiently? What they are doing and when they will break down? The real-time monitoring will give you all the answers. Through a real case study in the automotive industry, this session will show some best practices in managing and analysing telematic data in order to discover all the achievable benefits.

Mikheil Nadareishvili is heading a data science unit at TBC bank, which is tasked with development of a centralized data repository, client analytics (such as segmentation and behavior models), and promotion of evidence-based decision-making culture in the organization.

Prior to TBC, Mikheil worked on applying data science to various domains, most notably to real estate (to determine housing market trends and predict real estate prices), and education (to determine factors that influence students’ educational attainment in Georgia).

Presentations

Data-driven journey to Customer-centric Banking Session

TBC Bank has undergone an analytic shift during the last three years as a part of a larger shift from product-centric approach to client-centric one. Mainly Excel-based analytics, which was siloed in business units with low reproducibility and limited customer view has been replaced by integrated 360° view of customer and advanced analytics, enabling personalized service.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Setting Up a Lightweight Distributed Caching Layer using Apache Arrow Session

This talk will deep-dive on a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. We'll start with an overview of the system design and deployment architecture. This includes coverage of cache lifecycle, update patterns, cache cohesion and appropriate use cases.

Paco Nathan leads the Learning Group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

Human-in-the-loop: a design pattern for managing teams which leverage ML Session

Human-in-the-loop has been used for simulation, training, UX mockups, etc. A variant of semi-supervised learning called _active learning_ has emerged to leverage HITL as a design pattern for prioritizing how people and machines work together. This talk reviews case studies and management perspectives, plus related open source and commercial products, and a focus on use cases at O'Reilly Media.

Allison is a highly energetic executive and experienced intrapreneuer with a track record of driving strategic growth and transformation through data and insight. At Cox Automotive, Allison has led the launch of Cox Automotive Data Solutions, helping internal and external clients use data to improve decision-making. Previously Allison developed the entire product portfolio for LexisNexis’s new UK venture, leading to double-digit year-on-year growth for while transforming the motor insurance industry. A trained quantitative political scientist, Allison holds a BA in mathematics and international relations from the College of Wooster, and an MA in political science from the University of Michigan.

Presentations

Extracting value from data: How Cox Automotive is using data to drive growth and transform the way the world buys, sells, and owns cars. Session

2.5 years into its data journey, Cox Automotive has realised significant benefits by harnessing the power of data both internally and in the development of data solutions to improve decision-making within automotive. This session provides a case study of that transformation, including how to mobilise data for production, and how to drive the cultural changes to become data-driven organisation.

Kim Nilsson is the CEO of Pivigo, a London-based data science marketplace and training provider responsible for S2DS, Europe’s largest data science training program, which has by now trained more than 430 fellows working on over 120 commercial projects with 80+ partner companies, including Barclays, KPMG, Royal Mail, News UK, and Marks & Spencer. An ex-astronomer turned entrepreneur with a PhD in astrophysics and an MBA, Kim is passionate about people, data, and connecting the two.

Presentations

Successful Data Cultures: Inclusivity, empathy, retention, and results Session

Our lives are being transformed by data. Our work, our play and our health are now understood in new ways. Every organisation can take advantage of this resource. But something is holding us back; us! This talk discusses how to build a successful data culture. To embed data at the heart of every organization through people. How empathy, communication and humanity delivers success.

Michael Noll is a product manager at Confluent, the company founded by the creators of Apache Kafka. Previously, Michael was the technical lead of DNS operator Verisign’s big data platform, where he grew the Hadoop, Kafka, and Storm-based infrastructure from zero to petabyte-sized production clusters spanning multiple data centers—one of the largest big data infrastructures in Europe at the time. He is a well-known tech blogger in the big data community. In his spare time, Michael serves as a technical reviewer for publishers such as Manning and is a frequent speaker at international conferences, including Strata, ApacheCon, and ACM SIGIR. Michael holds a PhD in computer science.

Presentations

Unlocking the world of stream processing with KSQL, the streaming SQL engine for Apache Kafka Session

We introduce KSQL, the open source streaming SQL engine for Apache Kafka. KSQL makes it easy to get started with a wide range of real-time use cases such as monitoring application behavior and infrastructure, detecting anomalies and fraudulent activities in data feeds, and real-time ETL. We cover how to get up and running with KSQL and also explore the under-the-hood details of how it all works.

Michael Nolting is a data scientist for Volkswagen commercial vehicles. Michael has worked in a variety of research fields at Volkswagen AG, including adapting big data technologies and machine learning algorithms to the automotive context. Previously, he was head of a big data analytics team at Sevenval Technologies. Michael holds a Dipl.-Ing. degree in electrical engineering and an MSc degree in computer science, both from the Technical University of Brunswick in Germany, and a PhD in computer science.

Presentations

Elastic Map Matching Using Cloudera Altus and Apache Spark Session

Map matching applications exist in almost every telematics use case and are therefore crucial to all car manufacturers. This talk details the architecture behind Volkswagen Commercial Vehicle’s Altus-based map matching application. It closes with a live demo featuring the map matching job in Altus.

Nick is LinkedIn’s Director of Public Policy and Government Affairs for the Asia Pacific region, leading the company’s efforts to build productive partnerships with governments, decision makers, and policy influencers throughout the region.

His role includes policy and political outreach; government-focused data-sharing projects; work on technology policy issues; and development of workforce and education policy solutions that are at the core of LinkedIn’s corporate mission and its overarching vision of creating economic opportunity for every member of the global workforce.

Nick worked as legal counsel for Seven West Media before making the transition into a government relations and corporate affairs role. He left the television industry in 2012 joining Yahoo as their head of Public Policy in Asia before making the move to LinkedIn in 2015.

He holds a combined Bachelor of Media and Bachelor of Laws and a Masters in Media, Information Technology and Communications Law.

Nick has served as the Chair and Treasurer of the Asia Internet Coalition, leading joint-industry advocacy across Asia and locally as a committee member of the Communications and Media Law Association.

Presentations

Leveraging public-private partnerships using data analytics for economic insights Session

In October 2017, LinkedIn and the Australian Treasury teamed up to gain a deeper understanding of the Australian labour market through new data insights which may inform economic policy and directly benefit society. This presentation shares some of the discoveries, together with the practicalities of working in a public-private partnership.

Brian O’Neill is a product designer and founder of the consulting firm, Designing for Analytics, which helps companies design indispensable data and analytics software products that customers love. His clients and past employers include DELL/EMC, NetApp, Tripadvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JP Morgan Chase, the Future of Music Coalition, and ETrade among others, and he has worked on award-winning storage industry software for Akorri and Infinio. Around 2010, Brian co-founded the adventure travel company, TravelDragon.com, and he has invested in several Boston-area startups as well.

If you’re at a conference, he’ll probably be the only person walking around with an orange leather messenger bag.

Presentations

Design for Non-Designers: Tips for Creating Indispensable Data Products Session

Do you spend a lot of time explaining your data analytics product to your customers? Is your UI/UX or navigation overly complex? Are sales suffering due to complexity, or worse, are customers not using your product? Your design may be the problem. Brian O'Neill shares a secret: you don't have to be a trained designer to recognize design and UX problems and start correcting them today.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it Session

Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.

Francois Orsini is the chief technology officer for MZ’s Satori business unit. Previously, he served as vice president of platform engineering and chief architect, bringing his expertise in building server-side architecture and implementation for a next-gen social and server platform; was a database architect and evangelist at Sun Microsystems; and worked in OLTP database systems, middleware, and real-time infrastructure development at companies like Oracle, Sybase, and Cloudscape. Francois has extensive experience working with database and infrastructure development, honing his expertise in distributed data management systems, scalability, security, resource management, HA cluster solutions, soft real-time and connectivity services. He also collaborated with Visa International and Visa USA to implement the first Visa Cash Virtual ATM for the internet and founded a VC-backed startup called Unikala in 1999. Francois holds a bachelor’s degree in civil engineering and computer sciences from the Paris Institute of Technology.

Presentations

Correlation Analysis on Live Data Streams Session

The rate of growth of data volume and velocity has been accelerating. Further, the variety of data sources also has been growing. This poses a significant challenge in extracting actionable insights in a timely fashion. The talk focuses on how marrying correlation analysis with anomaly detection can help to this end. Also, robust techniques shall be discussed to guide effective decision making.

Richard Ott is a data scientist in residence at the Data Incubator, where he gets to combine his interest in data with his love of teaching. Previously, he was a data scientist and software engineer at Verizon. Rich holds a PhD in particle physics from the Massachusetts Institute of Technology, which he followed with postdoctoral research at the University of California, Davis.

Presentations

Hands-on data science with Python 2-Day Training

The Data Incubator offers a foundation in building intelligent business applications using machine learning. We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into two applications from real-world datasets.

Master degree in mathematics engineering with statistics specialization in Politecnico of Turin. Junior data scientist in Data Reply.

Presentations

Real-time monitoring discovering: a case study in automotive industry Session

Do you want to know if the products you make and sell are used efficiently? What they are doing and when they will break down? The real-time monitoring will give you all the answers. Through a real case study in the automotive industry, this session will show some best practices in managing and analysing telematic data in order to discover all the achievable benefits.

Kevin Parent is the CEO of Conduce, a company that helps leaders and teams see and interact with all their data instantly using a single, intuitive human interface. An innovator, Kevin’s entire career has focused on connecting the dots between advances in technology and human experiences. Previously, he cofounded Oblong Industries, where he invented new–to–the-world interfaces that allow users to interact with software using displays, gestures, wands, tablets, and smartphones, and spent 10 years engineering theme park attractions. (He was a project engineer for the Twilight Zone Tower of Terror at Walt Disney Imagineering.) Kevin is the author of six patents. He holds a degree in physics from the Massachusetts Institute of Technology, where his undergraduate thesis work was conducted in MIT’s Media Lab.

Presentations

IoT improves center of activity: DHL increases efficiency and reduces distance traveled across the warehouse Session

As an essential component of DHL’s IoT initiative, Conduce visualizations track and analyze distance traveled by personnel and warehouse equipment, all calibrated around a center of activity. Using immersive operational data visualization, DHL is gaining unprecedented insight to evaluate and act on everything that occurs in its warehouses.

Neejole Patel is a sophomore at Virginia Tech, where she is pursuing a BS in computer science with a focus on machine learning, data science, and artificial intelligence. In her free time, Neejole completes independent big data projects, including one that tests the Broken Windows theory using DC crime data. She recently completed an internship at a major home improvement retailer.

Presentations

Learning PyTorch by building a recommender system Tutorial

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model.

Joshua Patterson is the director of applied solutions engineering at NVIDIA. Previously, Josh worked with leading experts across the public and private sectors and academia to build a next-generation cyberdefense platform. He was also a White House Presidential Innovation Fellow. His current passions are graph analytics, machine learning, and GPU data acceleration. Josh also loves storytelling with data and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina’s Moore School of Business.

Presentations

GPU Accelerated Threat Detection with GOAI Session

Learn how we used GPU accelerated open source technologies to improve our cyber defense platforms at NVIDIA. Leveraging software from the GPU Open Analytics Initiative, GOAI, we improved the performance and scale of our threat hunting activities. We will discuss how we accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration.

Nick Pentreath is a principal engineer in IBM’s Cognitive Open Technology Group, where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Deep Learning for Recommender Systems Session

In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. Nick Pentreath explores recent advances in this area in both research and practice.

Thomas Phelan is cofounder and chief architect of BlueData. Prior to BlueData, Tom was an early employee at VMware and as senior staff engineer was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular “pluggable storage architecture.” He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, Tom was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64-bit file system.

Presentations

Deep Learning with TensorFlow and Spark using GPUs and Docker Containers Session

In the past, advanced machine learning techniques were only possible with a high-end proprietary stack. Today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. This session will focus on how to deploy TensorFlow and Spark, with Nvidia Cuda stack on Docker containers in a multi-tenant environment.

How to Protect Big Data in a Containerized Environment Session

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for Big Data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage; issues that are only compounded when running on Docker containers. This session will discuss these challenges and how to overcome them.

Aurélie Pols designs Data Privacy best practices: documenting data flows in order to limit Privacy backlashes, minimising risk related to ever increasing data uses while solving for data quality. The most accurate label today would probably be “Privacy Engineer”.
She spent the past 15 years optimising (digital) data-based decision-making processes. This allowed her to co-found and successfully sell her first start-up in Belgium to UK agency Digitas LBi (Publicis). She is used to following the money to optimise data trails; now she follows the data to minimise increasing compliance and Privacy risks while touching upon security best practices and ethical data uses. Her mantra is “Data is the new Electricity, Privacy is the new Green, Trust is the new Currency”.
Aurélie has spoken at various events such as SXSW, Strata + Hadoop World, the IAPP’s Data Protection Congress, Webit, eMetrics summits, and written several white papers on Data Privacy and Privacy engineering best practices. Her experience and network has allowed her to discuss growing data set-ups and requirements as well as their risk, compliance and ethical angles in Europe, the US and Asia.
She leads her own consultancy with data privacy projects all around the world, is part of the European Data Protection Supervisor’s (EDPS) Ethics Advisory Group (EAG) and served as Data Governance and Privacy Advocate for leading Data Management Platform (DMP) Krux Digital Inc., prior to its acquisition by Salesforce. She teaches Privacy and Ethics at IE Business School in Madrid and supports DPO training courses for the Solvay Business School in Brussels as well as Maastricht University, faculty of law. In terms of volunteering, she co-chairs the IEEE’s P7002 – Data Privacy Process standard initiative while serving as a training advisor to the IAPP, the International Association of Privacy Professionals.

Presentations

General Data Protection Regulation - GDPR - Tutorial (+ ePrivacy introduction) Tutorial

Using a 5+5 Pillars Framework for GDPR Readiness, this tutorial walks attendees through what the GDPR means to data fueled businesses. Anchored within the accountability principle, this interactive session allows to attribute responsibility to assure compliance and hopefully build towards ethical data practices, minimizing risk for your company while fostering trust with your clients.

Stuart loves storage (208 PB at Criteo) and is part of Criteo’s Lake team that runs some small and two rather large Hadoop clusters. He also loves automation with Chef because configuring more than 3000 Hadoop
nodes by hand is just too slow. Before discovering Hadoop he developed
user interfaces and databases for biotech companies.

Stuart has presented at ACM CHI 2000, Devoxx 2016, NABD 2016, Hadoop Summit Tokyo 2016, Apache Big Data Europe 2016, Big Data Tech Warsaw 2017, and Apache Big Data North America 2017.

Presentations

The Cloud is Expensive so Build Your Own Redundant Hadoop Clusters Session

Criteo has a production cluster of 2000 nodes running over 300000 jobs/day and a backup cluster of 1200 nodes. These clusters are in our own data centres as the cloud is more expensive. They were meant to provide a redundant solution to Criteo's storage and compute needs. We will explain our project, what went wrong, and our progress in building another cluster to survive the loss of a full DC.

Phillip Radley is chief data architect on BT’s core Enterprise Architecture team, where he is responsible for data architecture across BT Group Plc. Based at BT’s Adastral Park campus in the UK, Phill currently leads BT’s MDM and big data initiatives, driving associated strategic architecture and investment roadmaps for the business. Phill has worked in IT and the communications industry for 30 years, mostly with British Telecommunications Plc., and his previous roles in BT include nine years as chief architect for infrastructure performance-management solutions from UK consumer broadband to outsourced Fortune 500 networks and high-performance trading networks. He has broad global experience, including with BT’s Concert global venture in the US and five years as an Asia Pacific BSS/OSS architect based in Sydney. Phill is a physics graduate with an MBA.

Presentations

How BT delivers better Broadband & TV using Spark & Kafka Session

In the past 12 months British Telecom has added a streaming Network Analytics usecase to its multi-tenant data platform. This presentation shows how the solution works and is used deliver better Broadband and TV services. It explains how Kafka Spark on Yarn and HDFS encryption have been used to transform a mature Hadoop platform

Greg Rahn is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over his 20-year career, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently, product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Analytics in the Cloud - Building a Modern Cloud-based Big Data Warehouse Session

For many organizations, the cloud will likely be the destination of their next big data warehouse. The speakers will discuss considerations when evaluating the cloud for analytics and big data warehousing in order to steer attendees down the path of success allowing them to get the most from the cloud. Attendees will leave with an understanding of different architectural approaches and impacts.

Mala Ramakrishnan heads product initiatives for data engineering at Cloudera. She has 17+ years experience in marketing, product management, and software development in organizations of varied sizes that deliver middleware, software security, network optimization, and mobile computing. She holds a Masters in Computer Science from Stanford University. Outside of work, she is a mom of two boys 6 and 9 years of age.

Presentations

Running Data Analytic Workloads on the Cloud Tutorial

Cloud delivers solutions to single, multi-purpose clusters offering hyperscale storage, decoupled from elastic, on-demand computing. Join us to discuss the new paradigms to effectively run production level pipelines with minimal operational overhead. Remove barriers to data discovery, meta-data sharing, and access control.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Modern real-time streaming architectures Tutorial

The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. In this tutorial, we present an in-depth overview of state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

I have been working in Qubole in the Hive team for the past 2 years, and graduated in the year 2015 from BITS Pilani.

Presentations

Autonomous ETL with Materialized Views Session

A framework for Materialized Views in SQL-On-Hadoop engines that automatically suggests, creates, uses, invalidates and refreshes views created on top of data for optimal performance and strict correctness.

Delip Rao is the founder of R7 Speech Science, a San Francisco-based company focused on building innovative products on spoken conversations. Previously, Delip was the founder of Joostware, which specialized in consulting and building IP in natural language processing and deep learning. Delip is a well-cited researcher in natural language processing and machine learning and has worked at Google Research, Twitter, and Amazon (Echo) on various NLP problems. He is interested in building cost-effective, state-of-the-art AI solutions that scale well. Delip has an upcoming book on NLP and deep learning from O’Reilly.

Presentations

Machine learning with PyTorch 2-Day Training

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Sol currently holds 3 patents in the Information Management space and is a keynote speaker at several technology conferences speaking on various topics such as Machine Learning, Cognitive Computing, Data & Analytics, and Emerging Operating Models. Prior to joining Royal Caribbean, Sol was a Partner at Ernst & Young in the Data & Analytics practice, and before that she was a part of the 1st generation of leaders taking WATSON to market at IBM. Goal oriented and a team player, Sol believes in bringing together people, processes, and technology to cultivate environments that are innovative, driven, and collaborative A Thought Leader in the Data, Robotics, AI, and IT space. Sol is a competency player who solves complex Business, IT, and Operational problems and helps companies with large Transformational Programs. Sol has a unique ability in bridging the gap between Business and IT, her deep understanding of multiple functional disciplines (i.e. change management, enterprise data, application architecture, process re-engineering, sales, etc.) enables her to drive change by articulate the need for change in organizations that otherwise wouldn’t evolve with the emerging market place. Sol played NCAA Water Polo and Rugby for Cal on the Women’s National Rugby Team for several years, and completed the Ironman. She and her husband are avid wine drinkers, travelers, and learning to brew Belgium beer.

Presentations

Data, AI, and Innovation in the Enterprise Session

What are the latest initiatives and use cases around Data and AI within different corporations and industries? How are Data and AI reshaping different industries? What are some of the challenges of implementing AI within the enterprise setting? We’re convening four experts from different industries to answer these questions and more!

Erin is a software engineer and scrum master for the Infrastructure and Performance team at Zoomdata. Prior to Zoomdata, she worked as a full stack engineer at Appian and studied Computer Science and Mathematics at the University of Virginia. She moved to Zoomdata to fuel her learning in the big data space. Most recently, she and her team re-architected Zoomdata’s streaming capabilities.

Presentations

You’re Doing It Wrong: How Zoomdata Re-Architected Streaming Session

The value of real-time streaming analytics with historical data is crucial. Zoomdata is a big data application that updates historical dashboards in real time without complex re-aggregations, but streaming in the age of IoT requires handling of data in volumes not seen in traditional feeds. This talk will show how Zoomdata moved to a scalable microservice architecture for streaming sources.

Founder at RaRe Technologies Ltd, a leading R&D consulting company in Machine Learning and Natural Language Processing. Creator of Gensim, a popular open source Python library for topic modelling and information retrieval. Building practical solutions for businesses for over a decade. @radimrehurek

Presentations

Lesson From Open Source: Bridging the Gap Between Academia and Industry Session

8 years of building complex machine learning systems for clients like Amazon, Hearst or Autodesk. 7 years of maintaining Gensim, a popular open source package for NLP. 3 years of running the RARE Incubator program to mentor talented students. What are the lessons learned? Radim shares tips for successful R&D in applied data science.

Alberto Rey is head of data science at easyJet, where he leads easyJet’s efforts to adopt advance analytics within different areas of the business. Alberto’s background is in air transport and economics, and he has more than 15 years’ experience in the air travel industry. Alberto started his career in advanced analytics as a member of the Pricing and Revenue Management team at easyJet, working in the development of one of the most advanced pricing engines within the industry, where his team pioneered the implementation of machine-learning techniques to drive pricing. He holds an MSc in data mining and an MBA from Cranfield University.

Presentations

Data science survival and growth within the corporate jungle: An easyJet case study Session

In-house data science teams often work with a range of business functions, apply diverse techniques and face unpredictable hurdles related to requirements, data, infrastructure and deployment. Traditional data science processes are too abstract to cope with the complexity of these environments. This session will use recent project examples at easyJet to highlight how we overcame these challenges.

Nikki Rouda is the cloud and core platform director at Cloudera. Nik has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their IT challenges. His career spans big data, analytics, machine learning, AI, storage, networking, security, and the IoT. Nik holds an MBA from Cambridge and an ScB in geophysics and math from Brown.

Presentations

Security, governance, and cloud analytics, oh my! Session

It's a dream that there are so many cloud-based analytics services available. It's also a nightmare lurking to manage proper security and governance across all those different services. We will offer practical advice on how to minimize the risk and effort in protecting and managing data for multi-disciplinary analytics. We'll show you how to avoid the hassle and extra cost of silo'ed approaches.

Dr Christopher Royles is a Systems Engineer at Cloudera.

He holds a PhD in Artificial Intelligence from Liverpool University which he subsequently applied to voice recognition and voice dialogue systems. Chris has advised on Government Open Data initiatives as part of the Open Data User Group (ODUG) and sat on the quick wins stream of the UK Government Cloud Program (GCloud).

Chris has built out large scale data lakes on Amazon and Azure as well as assisted customers in their initial MVP through to full scale production.

http://www.linkedin.com/in/chrisroyles

Presentations

Practical advice for driving down the cost of cloud big data platforms Session

Big Data and Cloud deployments return huge benefits in flexibility and economics, they can also result in runaway costs and failed projects. Based on practical production experience this session will help your initial sizing, strategic planning through to longer term operation, the focus being delivering an efficient platform, reducing costs and a successful project.

Omer Sagi

Senior Data Scientist at the data science team, Dell IT.
Mr. Sagi is a seasoned data scientist at Dell and currently a PhD candidate in the department of software and information systems engineering at Ben-Gurion University.
In Dell, Omer lead several data science projects in the fields of percision agriculture, online marketing, failure prediction, text classification and more. In his PhD, Omer’s research aims at developing algorithms that simplify ensemble models. His Master’s thesis in the department of industrial engineering at Ben-Gurion University presented a novel approach for assessing the monetary damages of data loss incidents. Omer also taught several courses such as Java programing and Databases.

Presentations

Improving DevOps and QA efficiency using Machine Learning and NLP methods Session

DevOps and QA engineers devote significant amount of time to investigate reoccurring issues. These issues are often represented by large configuration and log files so the process of investigating whether two issues are duplicates can be a very tedious task. This session presents a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues at Dell.

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by data scientists, particularly focusing on the Apache Spark ecosystem. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka. Neelesh holds a master’s degree in computer science with a focus on cloud computing from North Carolina State University and a bachelor’s degree in computer engineering from the University of Mumbai, India.

Presentations

A Compute Infrastructure for Data Scientists Session

This talk focuses on the Compute Infrastructure used by the large Data Science team at Stitch Fix. We shall have a look at the architecture, the interacting tools within the ecosystem and discuss the challenges that we overcame along the way.

“Guillaume is the Machine Learning Services Team Leader”.

After working on datalake features, he switched to applicated Machine Learning in order to design a stack of services around this new field of application. His role is to extract high value from specific data science application in order to generify it and make it available to all.

Presentations

Continuous Delivery and Machine Learning Session

We will present how we did solve the issue about continuous deployment of machine learning models lead us to build a full stack of automated Machine Learning. Automated Machine learning allows us to rebuild models efficiently and keep our models up to date with fresh data brought by our data convergence tool. It also offers model management, by keeping the history of models and their performances

Mathew Salvaris is a data scientist at Microsoft. Previously, Mathew was a data scientist for a small startup that provided analytics for fund managers and a postdoctoral researcher at UCL’s Institute of Cognitive Neuroscience, where he worked with Patrick Haggard in the area of volition and free will, devising models to decode human decisions in real time from the motor cortex using electroencephalography (EEG), and a postdoctoral position at the University of Essex’s Brain Computer Interface group, where he worked on BCIs for computer mouse control. Mathew holds a PhD in brain computer interfaces and an MSc in distributed artificial intelligence.

Presentations

Distributed Training of Deep Learning Models Session

In this talk we will present two platforms for running distributed deep learning training in the cloud. We will train a ResNet network on ImageNet dataset using some of the most popular deep learning frameworks. We will then compare and contrast the performance improvement as we scale the number of nodes as well as provides tips and details of the pitfalls of each framework and platform.

Mark Samson is a principal systems engineer at Cloudera, helping customers solve their big data problems using enterprise data hubs based on Hadoop. Mark has 17 years’ experience working with big data and information management software in technical sales, service delivery, and support roles.

Presentations

Running Data Analytic Workloads on the Cloud Tutorial

Cloud delivers solutions to single, multi-purpose clusters offering hyperscale storage, decoupled from elastic, on-demand computing. Join us to discuss the new paradigms to effectively run production level pipelines with minimal operational overhead. Remove barriers to data discovery, meta-data sharing, and access control.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

Using a Global Data Fabric to Run a Mixed Cloud Deployment Session

Creating a business solution is a lot of work. Instead of building to run on a single cloud provider, it is far more cost effective to leverage the cloud as infrastructure-as-a-service (IaaS). Using a global data fabric is a requirement for running on all cloud providers simultaneously. Includes having multi-master, active-active environments with full support for disaster management.

Jonathan Seidman is a software engineer on the partner engineering team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next generation data platform Tutorial

Using the Internet of Things and Customer 360 as an example, we’ll explain how to architect a modern, real-time data platform leveraging recent advancements in open-source software. We’ll show how components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines can enable new forms of data processing and analytics.

Stuart runs one of the most innovative AI companies in the world, pioneering a new class of AI, “Behavioural AI”. This isn’t exactly surprising based on his history. Having studied Managerial Sciences in university, specifically aspect of Managerial Accounting relating to Behavioural Economics and Organizational Behaviour, Stuart went on to start an early Management Information Services company which, in the late 1980’s, ran a private network for mail and file transfer between Toronto, New York and Mexico City. Selling that company, Stuart founded one of Canada’s first Digital Agencies – before the Internet was a household thing! That company was sold in 1999 to WPP and Stuart transitioned into Finance, creating an innovative sub-prime lender, which built a substantial automotive loan portfolio which was sold shortly before the financial crisis of 2007. After that Stuart once again got interested in Technology, and specifically the early stages of AI. Today you will find him working with some of the world’s largest companies, designing and implementing AI that changes the way they understand their Big Data.

Presentations

Blind Men & Elephants: What’s Missing from Your Big Data? Session

Analytics using big data tends to focus on what is easily available, which is, by and large, data about what has already happened. The implicit assumption is that past behaviour will predict future behaviour. In this presentation we will show how organizations already possess data they aren’t exploiting that, with the right tools, can be used to develop far more powerful predictive algorithms.

Tomer Shiran is cofounder and CEO of Dremio. Previously, Tomer was the VP of product at MapR, where he was responsible for product strategy, roadmap, and new feature development. As a member of the executive team, he helped grow the company from 5 employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He is the author of five US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion, the Israel Institute of Technology.

Presentations

Data Science Across Data Sources with Apache Arrow Session

In the era of microservices and cloud apps, it is often impractical for organizations to physically consolidate all data into one system. Apache Arrow is an open source, columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real-time, simplifying and accelerating data access, without having to copy all data into one location.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Understanding data at scale leveraging Spark and Deep Learning Frameworks. Tutorial

We go through approaches for preprocessing, training, inference and deployment across data sets (time-series, audio, video and text), leveraging Spark, extended ecosystem of libraries and Deep Learning Frameworks. We use respective (sample) data and code to understand implementation nuances, and subsequently highlight the bottlenecks and solutions for data/model at scale.

Zubin is the founder and CEO of QED Analytics, a consultancy specialising in mathematical modelling and machine learning. His work has involved a wide variety of industries, including cryptography, financial algorithms, genetics, and the design of strategy and training systems for world-leading sports teams and Olympic squads. Alongside his work at QED, Zubin is a Fellow in industrial and applied mathematics at Oxford University, and a lecturer and tutor in pure mathematics. He was recently voted the most outstanding tutor in Oxford across Mathematical, Physical, Life and Engineering Sciences.

Presentations

Keynote with Zubin Siganporia Keynote

Zubin Siganporia, founder and CEO of QED Analytics.

Kevin Sigliano is a professor of Digital Transformation and Entrepreneurship at IE Business School. He has over 15 years of corporate experience in management consulting firms (PWC Consulting, IBM BCS).

Kevin has also launched numerous start-ups and has always reserved a part of his time to do Pro Bono and educational work.

Over the last 10 years, Kevin has been involved in the international development of a smart cities and digital signage company, Admira.

Lastly, Mr. Sigliano is managing partner of Good Rebels (Digital Transformation consulting firm) developing disruptive strategies for Global players.

Presentations

Executive Briefing: The ROI of Data Driven Digital Transformation Session

Financial and Consumer ROI demands that business leaders understand the drivers and dynamics of digital transformation and big data. Disrupting value propositions and continuous innovation are critical if you wish to dramatically improve the way your company engages, creates value and maximize financial results.

Vartika Singh is a solutions consultant at Cloudera. Previously, Vartika was a data scientist applying machine-learning algorithms to real-world use cases, ranging from clickstream to image processing. She has 10 years of experience designing and developing solutions and frameworks utilizing machine-learning techniques.

Presentations

Understanding data at scale leveraging Spark and Deep Learning Frameworks. Tutorial

We go through approaches for preprocessing, training, inference and deployment across data sets (time-series, audio, video and text), leveraging Spark, extended ecosystem of libraries and Deep Learning Frameworks. We use respective (sample) data and code to understand implementation nuances, and subsequently highlight the bottlenecks and solutions for data/model at scale.

Konrad Sippel leads the MD&S Content Lab, which develops unique and scalable IP to drive growth to Deutsche Borse’s services and products. Prior to his role as Head of Content lab, Konrad led Business Development at STOXX Ltd.

Presentations

How Deutsche Börse Designed a World-Class Analytics Lab Session

Deutsche Börse has built out a data science team, the Content Lab, dedicated to advanced analytics efforts focused on driving new products in risk management and investment decision-making. In this session, Deutsche Börse will discuss the people, processes and technology that make up this initiative and the early successes driven out of the Content Lab.

I work for Qubole in the Hive team, and have been working here for the past two and a half years. I have previously worked for Citrix and Cisco, and graduated in the year 2010 from NIT Allahabad.

Presentations

Autonomous ETL with Materialized Views Session

A framework for Materialized Views in SQL-On-Hadoop engines that automatically suggests, creates, uses, invalidates and refreshes views created on top of data for optimal performance and strict correctness.

Prior to working at Captricity, Ramesh received his PhD in Electrical Engineering and Computer Science from MIT’s Computer Science and Artificial Intelligence lab (CSAIL), where his thesis focused on developing machine learning and computer vision technologies to enhance medical image analysis. Working to enable a cross-collaboration between researchers and doctors to understand large, complex medical image collections, Ramesh’s research has been applied to predict the effects of diseases such as Alzheimer’s on brain anatomy. Above all else, Ramesh is passionate about using technology for social good.

Presentations

How Captricity manages 10,000 tiny deep learning models in production Session

Most uses of deep learning involve very deep models trained with large datasets. At Captricity, we're also using deep learning with tiny datasets at scale, training thousands of models using tens to hundreds of examples each. These models are dynamically trained using our automatic deployment framework, and carefully chosen metrics further exploit error properties of the resulting models.

Bargava Subramanian is a machine learning engineer based in Bangalore, India. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimization in Python and R around the world. He mentors early-stage startups in their data science journey. Bargava holds a master’s degree in statistics from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations

Architectural Design for Interactive Visualization Session

Visualisation for data science requires an interactive visualisation setup which works at scale. In this talk, we will explore the key architectural design considerations for such a system and illustrate, using real-life examples, the four key tradeoffs in this design space - rendering for data scale, computation for interaction speed, adaptive to data complexity and responsive to data velocity.

Deep Learning in the Browser: Explorable Explanations, Model Inference & Rapid Prototyping Session

We showcase three live-demos of doing deep learning (DL) in the browser - for building explorable explanations to aid insight, for building model inference applications and even, for rapid prototyping and training ML model - using the emerging client-side Javascript libraries for DL.

Machine learning practitioner with a strong academic background (PhD) in Artificial Intelligence. Experienced Lecturer with a demonstrated success of delivering core CS courses to undergrads in Ben-Gurion University. By combining both these expertise I work today as senior data scientist at Dell with broad view of both the business and the scientific aspects of the data science life-cycle. Independent learner and researcher. Fluent with the common data science toolbox (Python, Pandas, SQL, Spark, etc.).

Presentations

Improving DevOps and QA efficiency using Machine Learning and NLP methods Session

DevOps and QA engineers devote significant amount of time to investigate reoccurring issues. These issues are often represented by large configuration and log files so the process of investigating whether two issues are duplicates can be a very tedious task. This session presents a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues at Dell.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine learned models crash & burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. This is a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Elena Terenzi has been advocating Business Analytics and Big Data solutions for the Manufacturing sector in Western Europe. As part of this Elena has helped big automotive customers to implement telemetry analytics solutions with IoT flavor in their enterprises.

Elena has been at Microsoft for 9 years bringing Business Intelligence solutions to Microsoft Enterprise customers. Before joining Microsoft she started her career with data in 2003 as a Database Administrator and Data Analyst for an investment bank in Italy. Elena has a Master`s degree with a major in AI and NLP from the University of Illinois at Chicago.

Presentations

Detecting Small Scale Mines in Ghana Session

Computer vision is becoming one of the focus areas in artificial intelligence (AI) to enable computers to see and perceive like humans. In a collaboration with the Royal Holloway University, we applied deep learning to locate small scale mines in Ghana using satellite imagery, scaled training using Kubernetes and investigated their impact on surrounding populations & environment.

Niranjan Thomas is General Manager, Platform and Technology Partnerships, Professional Information Business, at Dow Jones. He is responsible for the DNA data platform including product strategy, go to market, customer solutions and driving technology ecosystem partnerships.
Niranjan has over 16 years of experience in Technology leadership roles across software design and development, software project management, cybersecurity, risk and business management. He has spent significant time delivering solutions for the manufacturing, consulting, technology, telecommunications, media and financial services industries. Prior to joining Dow Jones he served as Head of Technology at AMP, Australia’s leading specialist wealth management and life insurance company.
Niranjan holds a Bachelor’s degree in Business Information Systems from RMIT University, Melbourne, Australia.

Presentations

Unlocking the hidden potential of bad news Exploring the imaginative use of news derived data as an essential ingredient for data scientists and developers to uncover and solve complex societal problems. Session

What does a hurricane, the zika virus & modern slavery have in common? On one hand they’re serious global issues causing death, human suffering & destruction. On the other hand they’re also challenges the Strata community can seek to better understand and minimize negative consequences. All with news derived data. We aim to inspire the data science community to think creatively about new datasets.

Lucio Tolentino is a computer and data scientist at Mashable, a digital tech-media company. Leveraging his background in computational epidemiology, he combs through data on how Mashable’s content is shared on social media for interesting and actionable insights. His research focuses on using network theory to understand news as a form of contagion, and applying machine learning to optimize content production.

Presentations

How Mashable uses data science to make stories go viral Session

When a particular story gets widespread attention we often say that it has gone viral, but the underlying process of how this happens is complex, with a multitude of different contributing factors. We present a series of experiments with random networks and viral content produced by Mashable that informs our understanding of viral news stories.

Teresa Tung is a managing director at Accenture Technology Labs, where she is responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s industry platforms like Accenture Digital Connected Products and the Accenture Analytics Insights Platform. Teresa holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

Executive Briefing: Becoming a data-driven enterprise—A maturity model Session

A data-driven enterprise maximizes the value of its data. But how do enterprises emerging from technology and organization silos get there? We use our experience helping our clients through this journey to create a data-driven enterprise maturity model that spans technology and business requirements. We will walk through use cases that bring the model to life.

Kate is a data scientist at the One Campaign and a Chapter Lead at DataKind UK. She first built a career in investment management at Värde Partners, where she worked in roles across risk management, trading, investment, data strategy and portfolio strategy. In late 2015, she left to pursue her passions for data science and social impact; today she consults with charities, NGOs and corporates to find stories and insights in data.

Presentations

Executive Briefings: Killer robots and how not to do data science Session

Not a day goes by without reading headlines about the fear of AI or how technology seems to be dividing us more than bringing us together. Here at DataKind UK we're passionate about how machine learning and artificial intelligence can be used for social good. We'll talk about what socially conscious AI looks like, and what we're doing to make it a reality.

Emre Velipasaoglu is a principal data scientist at Lightbend. A machined learning expert, Emre previously served as principal scientist and senior manager in Yahoo! Labs. He has authored 23 peer-reviewed publications and nine patents in search, machine learning, and data mining. Emre holds a PhD in electrical and computer engineering from Purdue University and completed postdoctoral training at Baylor College of Medicine.

Presentations

Machine Learned Model Quality Monitoring in Fast Data and Streaming Applications Session

Most machine learning algorithms are designed to work on stationary data. Yet, real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Here, we review the monitoring methods and evaluate them for applicability in modern fast data and streaming applications.

Nanda Vijaydev has 10 years of experience in Data Management and Data Science. Currently at BlueData, she is leveraging Hadoop, Spark and Tachyon to build solutions for enterprise analytics use cases. Prior to joining BlueData, she has worked on Data Science and Big Data projects in multiple industries including healthcare and media. She was a principal solutions architect at Silicon Valley Data Science and Director of Solutions Engineering at Karmasphere prior to that. She has an in-depth understanding of the Data Analytics and Data Management space including data integration, ETL, warehousing, reporting and Hadoop.

Presentations

Deep Learning with TensorFlow and Spark using GPUs and Docker Containers Session

In the past, advanced machine learning techniques were only possible with a high-end proprietary stack. Today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. This session will focus on how to deploy TensorFlow and Spark, with Nvidia Cuda stack on Docker containers in a multi-tenant environment.

Jivan is part of the Data + Design team at Fjord, where she focuses on building adaptable, human-centric data-driven experiences. She has a background in data science and has worked mainly in the healthcare and retail sectors, using data to drive design and business decision-making. Prior to Fjord, she was a part of the Advanced Analytics & AI team at Accenture’s multi-disciplinary research and incubation hub in Dublin.

Presentations

Designing Ethical Artificial Intelligence Session

There is no doubt that artificial intelligence systems are a powerful agent of change in our society. As this technology becomes increasingly prevalent, transforming our understanding of ourselves and our society, issues around ethics and regulation arise. In this session, we will explore how we can address fairness, accountability and the long-term effects on our society when designing with data.

Humanising Data — How to Find the Why Session

Data has opened up huge possibilities for analysing & customising services. We can now manage experiences to dynamically target audiences & respond immediately, however context is often missing. We will go though practical steps to help you find the why behind the data patterns you are seeing and how to decide what level of personalised service to create.

Naghman Waheed leads the Data Platforms team at Monsanto and is responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order-to-cash, finance, and procurement. Throughout his 20+-year career at Monsanto, Naghman has held a variety of positions in the data space, ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.

Presentations

You call it Data Lake, we call it Data Historian Session

Last few years have seen a number of tools appear in the market which make it easy to implementation a Data Lake. However, most tools lack essential features that prevent the data lake from turning into a data swamp. At Monsanto, our data platform engineering team embarked on building a platform which can ingest, store, access data sets without compromising ease of use, governance and security.

Dean Wampler, Ph.D., is the VP of Fast Data Engineering at Lightbend. He leads the development of Lightbend Fast Data Platform, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly Media. He is a contributor to several open source projects, a frequent Strata speaker, and the co-organizer of several conferences around the world and several user groups in Chicago. Dean lurks on Twitter as @deanwampler.

Presentations

Executive Briefing: What You Need to Know about Fast Data Session

Streaming data systems, so called "Fast Data", promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just "faster" versions of Big Data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. This talk tells you what you need to know to exploit Fast Data successfully.

Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams Tutorial

This hands-on tutorial builds streaming apps as _microservices_ using Kafka with Akka Streams and Kafka Streams. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll be better informed when choosing tools for your needs. We'll contrast them with Spark Streaming and Flink, including when to chose them instead.

Hope Wang is a software engineer in Intuit’s Small Business Data and Analytics group. Hope is a self-taught, self-motivated, fully powered hacker and passionate about innovation. She holds a master’s degree in Biomedical Engineering from the University of Southern California.

Presentations

Machine Learning Platform Life-Cycle Management Session

There’s increased demand of developing and scaling machine learning capabilities. A machine learning platform includes multiple phases which are iterative and overlapping with each other. Hope explains how to manage various artifacts, their associations, automate deployment in order to support the life-cycle of a model and build a cohesive Machine Learning platform.

Rachel Warren is a programmer, data analyst, adventurer, and aspiring data scientist. After spending a semester helping teach algorithms and software engineering in Africa, Rachel has returned to the Bay Area, where she is looking for work as a data scientist or programmer. Previously, Rachel worked as an analyst for both Pandora and the Political Science department at Wesleyan. She is currently interested in pursuing a more technical, algorithmic, approach to data science and is particularly passionate about dynamic learning algorithms (ML) and text analysis. Rachel holds a BA in computer science from Wesleyan University, where she completed two senior projects: an application which uses machine learning and text analysis for the Computer Science department and a critical essay exploring the implications of machine learning on the analytic philosophy of language for the Philosophy department.

Presentations

Understanding Spark Tuning with Auto Tuning (or magical spells to stop your pager going off at 2am*) Session

Apache Spark is an amazing distributed system, but part of the bargain we've all made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. This talk will look at auto-tuning jobs using both historical and live job information using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Jim Webber is Chief Scientist at Neo Technology working on next-generation solutions for massively scaling graph data. Prior to joining Neo Technology, Jim was a Professional Services Director with ThoughtWorks where he worked on large-scale computing systems in finance and telecoms. Jim has a Ph.D. in Computing Science from the Newcastle University, UK.

Presentations

Mixing Causal Consistency and Asynchronous Replication for Large Neo4j Clusters Session

How neo4j mixes the strongly consistent Raft protocol with aysnc log shipping and provides a strong consistency guarantee: causal, which means you can always at least read-your-writes even in very large, multi-data center clusters.

Mike Wendt is a Manager of Applied Solutions Engineering at NVIDIA. His research work has focused on leveraging GPUs for big data analytics, data visualizations, and stream processing. Prior to joining NVIDIA, Mike led engineering work on big data technologies like Hadoop, Datastax Cassandra, Storm, Spark, and others. In addition, Mike has focused on developing new ways of visualizing data and the scalable architectures to support them. Mike holds a BS in computer engineering from the University of Maryland.

Presentations

GPU Accelerated Threat Detection with GOAI Session

Learn how we used GPU accelerated open source technologies to improve our cyber defense platforms at NVIDIA. Leveraging software from the GPU Open Analytics Initiative, GOAI, we improved the performance and scale of our threat hunting activities. We will discuss how we accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration.

Tony Xing is a senior product manager on the AIDI (AI, Data and Infrastructure team) within Microsoft’s AI and Research Org. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service group. Tony was the PM for multiple data products: data ingestion, real time data analytics, data quality platform…

Presentations

Bring AI to BI: Project Kensho - the road to automated business incident monitoring and diagnostics in Microsoft Session

Introducing project Kensho, the one stop shop for business incident monitoring and auto insights within Microsoft, our path of infuse AI into the BI to serve different Microsoft teams. Our lesson learnt, the technology evolution, the good and bad, the architecture, the algorithms. And engineering + data science solved a common need which is applicable not only for Microsoft but the industry.

Principal dev manager of AI Data and Infrastructure team of Microsoft

Presentations

Bring AI to BI: Project Kensho - the road to automated business incident monitoring and diagnostics in Microsoft Session

Introducing project Kensho, the one stop shop for business incident monitoring and auto insights within Microsoft, our path of infuse AI into the BI to serve different Microsoft teams. Our lesson learnt, the technology evolution, the good and bad, the architecture, the algorithms. And engineering + data science solved a common need which is applicable not only for Microsoft but the industry.

Fabian is the Chief Scientist of ShiftLeft Inc. Fabian has over 10 years of experience in the security domain, where he has worked as a security consultant and researcher, focusing on manual and automated vulnerability discovery. Throughout his work, he has identified previously unknown vulnerabilities in popular system components and applications such as the Microsoft Windows kernel, the Linux kernel, the Squid proxy server, and the VLC media player. He has presented his findings and techniques at both major industry conferences such as BlackHat USA, DefCon, First, and CCC, and renowned academic security conferences such as ACSAC, Security and Privacy, and CCS. He holds a master’s degree in computer engineering from Technical University Berlin, as well as a PhD in computer science from the University of Goettingen.

Presentations

Code Property Graph : A modern, queryable data storage for source code Session

While in the earlier days, code would generate data, with CPG we now generate data for the code so that we can understand it better.

Zhehan Wang. Chinese. Focus big data.

Presentations

Using Alluxio(formerly Tachyon) as a fault-tolerant pluggable optimization component to compute frameworks of JD system Session

JD.com use Alluxio to provide support for ADHOC and real-time stream computing. One of them, the JDPresto on Alluxio has led to a 10x performance improvement on average. We use Alluxio compatible hdfs-url and as a pluggable optimization component.