Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Speakers

Hear from innovative CxOs, talented data practitioners, and senior engineers who are leading the data industry. More speakers will be announced; please check back for updates.

Filter

Search Speakers

Alex Adam is a data scientist at Faculty. He’s particularly interested in generative neural networks and their applications both in natural language processing (text generation) and computer vision (video generation). He’s worked on many projects across sectors including retail, marketing, civil engineering, and private equity. The highlights of his career include his work being showcased at the Copenhagen Democracy Summit, presenting at O’Reilly conferences, BBC events, and being featured on BBC Newswround. Alex holds a PhD in theoretical physics from Imperial College London.

Presentations

Synthetic video generation: Why seeing should not always be believing Session

The advent of "fake news" has led us to doubt the truth of online media, and advances in machine learning give us an even greater reason to question what we are seeing. Despite the many beneficial applications of this technology, it's also potentially very dangerous. Alex Adam explains how synthetic videos are created and how they can be detected.


Peter Aiken is an acknowledged data management (DM) authority. He’s an associate professor of information systems at Virginia Commonwealth University (VCU), past president of the International Data Management Association (DAMA-I), and associate director of the MIT International Society of Chief Data Officers. He’s also the founder of Data Blueprint, a consulting firm that helps organizations leverage data for profit, improvement, competitive advantage, and operational efficiencies. A practicing data consultant, professor, author, and researcher, Peter has studied DM for more than 30 years and has assisted more than 150 organizations in 30 countries, including some of the world’s most important, gaining international recognition along the way. He is the author of 10 books and multiple publications, including his latest on data strategy. He’s a dynamic presence at events and hosts the longest running and most successful webinar series dedicated to DM (hosted by dataversity.net).

Presentations

Your data strategy: It should be concise, actionable, and understandable by business and IT Tutorial

Peter Aiken offers a more operational perspective on the use of data strategy, which is especially useful for organizations just getting started with data

Alasdair Allan is a director at Babilim Light Industries and a scientist, author, hacker, maker, and journalist. An expert on the internet of things and sensor systems, he’s famous for hacking hotel radios, deploying mesh networked sensors through the Moscone Center during Google I/O, and for being behind one of the first big mobile privacy scandals when, back in 2011, he revealed that Apple’s iPhone was tracking user location constantly. He’s written eight books and writes regularly for Hackster.io, Hackaday, and other outlets. A former astronomer, he also built a peer-to-peer autonomous telescope network that detected what was, at the time, the most distant object ever discovered.

Presentations

Executive Briefing: The intelligent edge and the demise of big data? Session

Alasdair Allan explains why the current age, where privacy is no longer "a social norm," may not long survive the coming of the internet of things, as new smart embedded hardware may cause the demise of large-scale data harvesting. Smart devices will process data at the edge, allowing us to extract insights from the data without storing potentially privacy- and GDPR-infringing data.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He’s taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He’s widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Creating a data engineering culture Session

In this talk, we will cover the most common reasons why data engineering teams fail and how to correct them. This will include ways to get your management to understand that data engineering is really complex and time consuming. It is not data warehousing with new names. Management needs to understand that you can’t compare a data engineering team to the web development team, for example.

Professional Kafka development 2-Day Training

Jesse Anderson offers an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it as well as how to create consumers and publishers. Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL.

Professional Kafka development (Day 2) Training Day 2

Jesse Anderson offers an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it as well as how to create consumers and publishers. Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL.

Eitan Anzenberg is the chief data scientist at Bill.com and has many years of experience as a scientist and researcher. His recent focus is in machine learning, deep learning, applied statistics, and engineering. Previously, Eitan was a postdoctoral scholar at Lawrence Berkeley National Lab and received his PhD in physics from Boston University and his BS in astrophysics from University of California, Santa Cruz. Eitan has two patents and 11 publications to date and has spoken about data at various conferences around the world.

Presentations

Explainable machine learning in fintech Session

Machine learning applications balance interpretability and performance. Linear models provide formulas to directly compare the influence of the input variables, while nonlinear algorithms produce more accurate models. Eitan Anzenberg explores a solution that utilizes what-if scenarios to calculate the marginal influence of features per prediction and compare with standardized methods such as LIME.

Amr Awadallah is the cofounder and CTO at Cloudera. Previously, Amr was an entrepreneur in residence at Accel Partners, served as vice president of product intelligence engineering at Yahoo, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr’s first startup, VivaSmart, was acquired by Yahoo in July 2000. Amr holds bachelor’s and master’s degrees in electrical engineering from Cairo University, Egypt, and a PhD in electrical engineering from Stanford University.

Presentations

BMW’s journey to the data-driven enterprise from the edge to AI Keynote

BMW Group is an extraordinary company. As a technology pioneer it's an enterprise that recognizes the value that data to offers to the business. The company's global platform draws data from over 150 different systems and delivers governed data to various divisions. Join Amr Awadallah and Tobias Burger to discover some of BMW's most important use cases leveraging data from the edge to AI.

Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Presentations

A Magic 8 Ball for optimal cost and resource allocation for the big data stack Session

Cost and resource provisioning are critical components of the big data stack. Shivnath Babu and Alkis Simitsis detail how to build a Magic 8 Ball for the big data stack—a decomposable time series model for optimal cost and resource allocation that offers enterprises a glimpse into their future needs and enables effective and cost-efficient project and operational planning.

Jason Bell specializes in high-volume streaming systems for large retail customers, using Kafka in a commercial context for the last five years. Jason was section editor for Java Developer’s Journal, has contributed to IBM developerWorks on autonomic computing, and is the author of Machine Learning: Hands On for Developers and Technical Professionals.

Presentations

Learning how to perform ETL data migrations with open source tool Embulk Session

The Embulk data migration tool offers a convenient way to load data in to a variety of systems with basic configuration. Jason Bell offers an overview of the Embulk tool and outlines some common data migration scenarios that a data engineer could employ using the tool.

Juan Bengochea is an enterprise architect at Royal Caribbean Cruise Lines.

Presentations

Cerebro: Bringing together data scientists and BI users on a common analytics platform in the cloud DCS

Juan Bengochea offers an overview of Cerebro, a platform developed at Royal Caribbean Cruise Lines for analytics that serves the needs of both its data scientists and analysts from a common system. Cerebro is based on a number of open source projects, including Hadoop, Parquet, Spark, and Apache Arrow, as well as cloud services like Azure Data Lake Store, Azure Databricks, and MongoDB Atlas.

Francine Bennett is a data scientist and the CEO and cofounder of Mastodon C, a group of Agile big data specialists who offer the open source Hadoop-powered technology and the technical and analytical skills to help organizations to realize the potential of their data. Before founding Mastodon C, Francine spent a number of years working on big data analysis for search engines, helping them to turn lots of data into even more money. She enjoys good coffee, running, sleeping as much as possible, and exploring large datasets.

Presentations

Using data for evil V: The AI strikes back Session

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis (and possibly AI). Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse.

Daniel Bergqvist works on the Cloud Platform team at Google. In his more than 10 years of experience in the software industry, Daniel has held positions at companies such as Ericsson and Opera Software. Daniel holds a bachelor’s degree in computer science from Uppsala University. He lives in Stockholm and likes to spend his spare time freediving.

Presentations

Processing 10M samples a second to drive smart maintenance in complex IIoT systems Session

Geir Engdahl and Daniel Bergqvist explain how Cognite is developing IIoT smart maintenance systems that can process 10M samples a second from thousands of sensors. You'll explore an architecture designed for high performance, robust streaming sensor data ingest, and cost-effective storage of large volumes of time series data as well as best practices learned along the way.

Anirudha Beria is a member of the technical staff at Qubole, where he’s working on query optimizations and resource utilization in Apache Spark.

Presentations

Scalability-aware autoscaling of a Spark application Session

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs at the same time. Scalability-aware autoscaling uses historical information to make better scaling decisions. Anirudha Beria and Rohit Karlupia explain how to measure the efficiency of autoscaling policies and discuss more efficient autoscaling policies, in terms of latency and costs.

Pradeep Bhadani is a senior big data engineer at Hotels.com in London, where he builds and manages cloud infrastructure and core services like Apiary. Pradeep has worked in the big data space, building large-scale platforms, for the last seven years.

Presentations

Herding elephants: Seamless data access in a multicluster clouds Session

Travel platform Expedia Group likes to give its data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. Pradeep Bhadani and Elliot West explain how the company built a unified virtual data lake on top of its many heterogeneous and distributed data platforms.

Aashish Bhateja is a senior program manager working on Microsoft Azure Machine Learning—building an exciting machine learning service that makes it easy for all data scientists and ML engineers to create and deploy robust, scalable, and highly available machine learning web services in the cloud.

Presentations

Time series forecasting with Azure Machine Learning Tutorial

Time series modeling and forecasting is fundamentally important to various practical domains; in the past few decades, machine learning model-based forecasting has become very popular in both private and public decision-making processes. Francesca Lazzeri walks you through using Azure Machine Learning to build and deploy your time series forecasting models.

Wojciech Biela is a co-founder of Starburst, where he’s responsible for product development. He has over 15 years’ experience building products and running engineering teams. Previously, Wojciech was the engineering manager at the Teradata Center for Hadoop, running the Presto engineering operations in Warsaw, Poland; built and ran the Polish engineering team for a subsidiary of Hadapt, a pioneer in the SQL-on-Hadoop space (acquired by Teradata in 2014); and built and led teams on multiyear projects from custom big ecommerce and SCM platforms to POS systems. Wojciech holds an MS in computer science from the Wroclaw University of Technology.

Presentations

The Presto Cost-Based Optimizer for interactive SQL on anything Session

Presto is a popular open source–distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3, Azure ADSL, RDBMS, NoSQL, etc). Wojciech Biela and Piotr Findeisen offer an overview of the Cost-Based Optimizer (CBO) for Presto, which brings a great performance boost. Join in to learn about CBO internals, the motivating use cases, and observed improvements.

Alun Biffin is a data scientist at Van Lanschot Kempen, where he applies machine learning to real-life business problems ranging from analyzing millions of web hits for online retailer Tails.com to predicting customer behavior at one of the Netherland’s largest private banks. Previously, Alun was a Marie Curie fellow at the Paul Scherrer Institute, Switzerland, where he designed and conducted groundbreaking experiments on quantum magnets at cutting-edge facilities in Europe, the US, and Japan. He has presented his work at international workshops and conferences and published three papers as first author. His work has been cited over 100 times. He was also a recipient of the highly selective ASI Data Science Fellowship, London, in the summer of 2018. Alun holds a PhD in condensed matter physics from the University of Oxford.

Presentations

Using machine learning for stock picking Session

Alun Biffin and David Dogon explain how machine learning revolutionized the stock-picking process for portfolio managers at Kempen Capital Management by filtering the vast small-cap investment universe down to a handful of optimal stocks.

Peter Billen is a principal director at Accenture Belux, where he leads the assets and offerings around data-driven architectures and helps financial services clients adapt to the growing importance of data in today’s digital context. With a passion for innovation and 15 years of experience, Peter is a strong advocate for the power of metadata and believes that it will enable companies to drive automation to a new level, allowing them to combine both delivery and solution automation from the design phase, resulting in many efficiency and effectiveness benefits.

Presentations

Leveraging metadata for automating delivery and operations of advanced data platforms Session

Peter Billen explains how to use metadata to automate delivery and operations of a data platform. By injecting automation into the delivery processes, you shorten the time to market while improving the quality of the initial user experience. Typical examples include data profiling and prototyping, test automation, continuous delivery and deployment, and automated code creation.

Zoltán Borók-Nagy is a software engineer at Cloudera, working on Apache Impala and is a member of the PMC of the project. Previously, Zoltán worked for Ericsson, developing software analysis tools that have since become open source.

Presentations

Picking Parquet: Improved performance for selective queries in Impala, Hive, and Spark Session

The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. Anna Szonyi and Zoltán Borók-Nagy share the technical details of the design and its implementation along with practical tips to help data architects leverage these new capabilities in their schema design and performance results for common workloads.

David Boyle is passionate about helping businesses to build analytics-driven decision making to help them make quicker, smarter, and bolder decisions. Previously, he built global analytics and insight capabilities for a number of leading global entertainment businesses covering television (the BBC), book publishing (HarperCollins Publishers), and the music industry (EMI Music), helping to drive each organization’s decision making at all levels. He builds on experiences working to build analytics for global retailers as well as political campaigns in the US and UK, in philanthropy, and in strategy consulting.

Presentations

Combining creativity and analytics Keynote

Companies that harness creativity and data in tandem have growth rates twice as high as companies that don’t. David Boyle shares lessons from his successes and failures in trying to do just that across presidential politics, with pop stars, and with power brands in the world of luxury goods. Join in to find out how analysts can work differently to build these partnerships and unlock this growth.

Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

Alex Thomas and Claudiu Branzan lead a hands-on introduction to scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working code base that you can change and improve.

Mikio Braun is a principal engineer for search at Zalando, one of Europe’s biggest fashion platforms. He worked in research for a number of years before becoming interested in putting research results to good use in the industry. Mikio holds a PhD in machine learning.

Presentations

Fair, privacy-preserving, and secure ML

Mikio Braun explores techniques and concepts around fairness, privacy, and security when it comes to machine learning models.

Avner Braverman is cofounder and CEO of Binaris, building an applications optimized serverless platform. Avner’s full stack ranges from hardware architecture, through kernel design and up to JavaScript applications.
He’s been working with distributed operating systems since his school days. Previously, he cofounded XIV, a distributed storage company, Parallel Machines, and a high-performance analytics company.

Presentations

Serverless for data and AI Session

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code.

Nicolette Bullivant is the head of data engineering at Santander UK Technology. A technical manager with 18 years’ experience in the IT services industry, she previously led large-scale multilocation change projects comprising data provision, managed MI and data warehouses, ETL, system integration, and IT alignment.

Presentations

Designing the foundation for a data-driven future in financial services Findata

Nicolette Bullivant discusses how Santander has restructured its business around data. Join in to learn about the people, processes, and technology the organization brought together to make it a success and get practical ideas to help you start or progress along your journey with big data.

Tobias Bürger leads the Platform and Architecture Group within the Big Data, Machine Learning, and Artificial Intelligence Department at BMW Group, where he is responsible for the global big data platform that is the core technical pillar of the BMW data lake and is used across different divisions inside the BMW Group, spanning areas such as production, aftersales, and ConnectedDrive.

Presentations

BMW’s journey to the data-driven enterprise from the edge to AI Keynote

BMW Group is an extraordinary company. As a technology pioneer it's an enterprise that recognizes the value that data to offers to the business. The company's global platform draws data from over 150 different systems and delivers governed data to various divisions. Join Amr Awadallah and Tobias Burger to discover some of BMW's most important use cases leveraging data from the edge to AI.

James Burke has been called “one of the most intriguing minds in the Western world” by the Washington Post. His audience is global. His influence in the field of the public understanding of science and technology is acknowledged in citations by such authoritative sources as the Smithsonian and Microsoft CEO Bill Gates. His work is on the curriculum of universities and schools across the United States. In 1965, James began work with BBC-TV on Tomorrow’s World and went on to become the BBC’s chief reporter on the Apollo Moon missions. For over 40 years, he has produced, directed, written, and presented award-winning television series on the BBC, PBS, Discovery Channel, and the Learning Channel. These include historical series, such as Connections (aired in 1979, it achieved the highest-ever documentary audience); The Day the Universe Changed; Connections2 and Connections3; a one-man science series, The Burke Special; a mini-series on the brain, The Neuron Suite; a series on the greenhouse effect, After the Warming; and a special for the National Art Gallery on Renaissance painting, Masters of Illusion.

A best-selling author, James’s publications include Tomorrow’s World, Tomorrow’s World II, Connections, The Day the Universe Changed, Chances, The Axemaker’s Gift (with Robert Ornstein), The Pinball Effect, The Knowledge Web, Circles, and American Connections. He has also written a series of introductions for the book Inventing Modern America and was a contributing author to Talking Back to the Machine and Leading for Innovation. His book Twin Tracks: The Unexpected Origins of the Modern World focuses on the surprising connections among the seemingly unconnected people, events and discoveries that have shaped our world. James also wrote and hosted a best-selling CD-ROM, Connections: A Mind Game and provided consult and scripting for Disney Epcot. James is a frequent keynote speaker on the subject of technology and social change to audiences such as NASA, MIT, IBM, Microsoft, US government agencies, and the World Affairs Council. He has also advised the National Academy of Engineering, the Lucas Educational Foundation, and the SETI project. He was a regular columnist for six years at Scientific American and most recently, contributed an essay on invention to the Britannica Online Encyclopedia. He’s currently a contributor to Time magazine. His most recent television work is a PBS retrospective of his work, ReConnections. Educated at Oxford and holding honorary doctorates for his work in communicating science and technology, his latest project is an online interactive knowledge-mapping system, (the knowledge web, to be used as a teaching aid, a management tool, and a predictor. It’s due to be online in 2020. His next book, The Culture of Scarcity, will also be published in 2020.

Presentations

Making the future Keynote

James Burke asks whether we can use big data and predictive analytics at the social level to take the guesswork out of prediction and make the future what we all want it to be. If so, this would give us the tools to handle what looks like being the greatest change to the way we live since we left the caves.

Julia Butter is an AI evangelist at Scout24, where she’s actively driving culture change within the company. Julia has a strong background in product development, including data products, strategy, and innovation. She’s an initiator of forward thinking and energizes through her creativity and enthusiasm.

Presentations

From data to data-driven to an AI-ready company: The culture change makes the difference DCS

Creating value out of your data is not about technology or engineers; it's about changing the culture in the company to make everyone aware of data and how to build on top of it. Julia Butter explains how Scout24 is running a successful culture change—60% of its employees are already using its central BI tool, and since 2018, it's been all about AI enablement.

Paris Buttfield-Addison is a cofounder of Secret Lab, a game development studio based in beautiful Hobart, Australia. Secret Lab builds games and game development tools, including the multi-award-winning ABC Play School iPad games, the BAFTA- and IGF-winning Night in the Woods, the Qantas airlines Joey Playbox games, and the Yarn Spinner narrative game framework. Previously, Paris was a mobile product manager for Meebo (acquired by Google). Paris particularly enjoys game design, statistics, blockchain, machine learning, and human-centered technology. He researches and writes technical books on mobile and game development (more than 20 so far) for O’Reilly; he recently finished writing Practical AI with Swift and is currently working on Head First Swift. He holds a degree in medieval history and a PhD in computing. Paris loves to bring machine learning into the world of practical and useful. You can find him on Twitter as @parisba.

Presentations

Science-fictional user interfaces Session

Science fiction has been showcasing complex, AI-driven interfaces for decades. As TV, movies, and video games have become more capable of visualizing a possible future, the grandeur of these imagined science fictional interfaces has increased. Mars Geldard and Paris Buttfield-Addison investigate what we can learn from Hollywood UX. Is there a useful takeaway? Does sci-fi show the future of AI UX?

Jian Chang is a senior algorithm expert at the Alibaba Group, where he is working on cutting-edge applications of AI at the intersection of high-performance databases and the IoT, focusing on unleashing the value of spatiotemporal data. A data science expert and software system architect with expertise in machine learning and big data systems and deep domain knowledge on various vertical use cases (finance, telco, healthcare, etc.), Jian has led innovation projects and R&D activities to promote data science best practices within large organizations. He’s a frequent speaker at technology conferences, such as the O’Reilly Strata and AI Conferences, NVIDIA’s GPU Technology Conference, Hadoop Summit, DataWorks Summit, Amazon re:Invent, Global Big Data Conference, Global AI Conference, World IoT Expo, and Intel Partner Summit, and has published and presented research papers and posters at many top-tier conferences and journals, including ACM Computing Surveys, ACSAC, CEAS, EuroSec, FGCS, HiCoNS, HSCC, IEEE Systems Journal, MASHUPS, PST, SSS, TRUST, and WiVeC. He’s also served as a reviewer for many highly reputable international journals and conferences. Jian holds a PhD from the Department of Computer and Information Science (CIS) at University of Pennsylvania, under Insup Lee.

Presentations

Building the data infrastructure for the internet of things at zettabyte scale Session

Jian Chang and Sanjian Chen share the architecture design and many detailed technology innovations of Alibaba TSDB, a state-of-the-art database for IoT data management, and discuss lessons learned from years of development and continuous improvement.

Jean-Luc Chatelain is a managing director for Accenture Digital and the CTO for Accenture Applied Intelligence, where he focuses on helping Accenture customers become information-powered enterprises by architecting state-of-the-art big data solutions. Previously, Jean-Luc was the executive vice president of strategy and technology for DataDirect Networks Inc. (DDN), the world’s largest privately held big data storage company, where he led the company’s R&D efforts and was responsible for corporate and technology strategy; a Hewlett-Packard fellow and vice president and CTO of information optimization responsible for leading HP’s information management and business analytics strategy; founder and CTO of Persist Technologies (acquired by HP), a leader in hyperscale grid storage and archiving solutions whose technology is the basis of the HP Information Archiving Platform IAP; and CTO and senior vice president of strategic corporate development at Zantaz, a leading service provider of information archiving solutions for the financial industry, where he played an instrumental role in the development of the company’s services and raised millions of dollars in capital for international expansion. He has been a board member of DDN since 2007. Jean-Luc studied computer science and electrical engineering in France and business at Emory University’s Goizueta Executive Business School. He is bilingual in French and English and has also studied Russian and classical Greek.

Presentations

An Innovation Architecture industrializes AI from PoCs to production Session

Innovation is abundant as companies reimagine themselves as data-driven and AI-powered businesses. How do enterprises organize to move beyond numerous, often similar proofs of concept (PoCs) into production-quality products and services? Teresa Tung and Jean-Luc Chatelain explore Accenture’s Innovation Architecture, which manages PoCs and pilots through embedding into scalable, saleable solutions.

Executive Briefing: Using a domain knowledge graph to manage AI at scale Session

How do enterprises scale moving beyond one-off AI projects to making it reusable? Teresa Tung and Jean-Luc Chatelain explain how domain knowledge graphs—the technology behind today's internet search—can bring the same democratized experience to enterprise AI. They then explore other applications of knowledge graphs in oil and gas, financial services, and enterprise IT.

Shailesh Chauhan leads business intelligence product management at Uber. Previously, he was a senior product manager at ThoughtSpot, where he was a member of the founding team. He helped build ThoughtSpot from 10 people to over 300 in five years and created the world’s first analytics search engine. He holds degrees from the University of California, Berkeley, the University of Illinois Urbana-Champaign, and IIT Guwahati.

Presentations

Integrated Business Intelligence Suite: How Uber built a platform to convert raw data into knowledge Session

Shailesh Chauhan explains how Uber built its business intelligence platform, detailing why the company took a platform approach rather than adding features in a piecemeal fashion.

Sanjian Chen is a senior algorithm expert at the Alibaba Group. He has deep knowledge of large-scale machine learning algorithms. Over his career, he’s developed cutting-edge data-driven modeling techniques and autonomous systems in both academic and industry settings and designed data-analytics solutions that drove numerous high-impact business decisions for multiple Fortune 500 companies across several industries, including retail, banking, automotive, and telecommunications. He’s currently working on building cutting-edge cloud-based AI engines for high-performance distributed database systems that support scalable data analytics in multiple business areas. Sanjian is a frequent invited speaker at top international conferences, including the Strata Data Conference (San Francisco, London), the IEEE Cyber-Physical Systems Week (Chicago), the IFAC conference on Analysis and Design of Hybrid Systems (Atlanta), and IEEE International Conference on Healthcare Informatics (Philadelphia, Dallas). He’s received two IEEE Best Paper Awards and published over 25 papers in top journals and conferences, including two published in the Proceedings of IEEE. He’s also served as an invited reviewer for numerous top international journals and conferences, including the IEEE Design & Test, IEEE Transactions on Computers, ACM Transactions on Cyber-Physical Systems, IEEE Transactions on Industrial Electronics, IEEE RTSS conferences, and the ACM HSCC conference. He holds a PhD in computer and information science from the University of Pennsylvania.

Presentations

Building the data infrastructure for the internet of things at zettabyte scale Session

Jian Chang and Sanjian Chen share the architecture design and many detailed technology innovations of Alibaba TSDB, a state-of-the-art database for IoT data management, and discuss lessons learned from years of development and continuous improvement.

Zhiling Chen is a machine learning engineer at GOJEK, one of the fastest growing startups in Asia. She and her colleagues work on scaling machine learning and driving impact throughout the organization. Her focus is on improving the speed at which data scientists iterate, the accuracy and performance of their models, the scalability of the systems they build, and the impact they deliver.

Presentations

Unlocking insights in AI by building a feature store Session

Features are key to driving impact with AI at all scales, allowing organizations to dramatically accelerate innovation and time to market. Willem Pienaar and Zhiling Chen explain how GOJEK, Indonesia's first billion-dollar startup, unlocked insights in AI by building a feature store called Feast, and the lessons they learned along the way.

Felix Cheung is an engineer at Uber and a PMC and committer for Apache Spark. Felix started his journey in the big data space about five years ago with the then state-of-the-art MapReduce. Since then, he’s (re-)built Hadoop clusters from metal more times than he would like, created a Hadoop distro from two dozen or so projects, and juggled hundreds to thousands of cores in the cloud or in data centers. He built a few interesting apps with Apache Spark and ended up contributing to the project. In addition to building stuff, he frequently presents at conferences, meetups, and workshops. He was also a teaching assistant for the first set of edX MOOCs on Apache Spark.

Presentations

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber Session

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame.

Ira Cohen is a cofounder and chief data scientist at Anodot, where he’s responsible for developing and inventing the company’s real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Sequence-to-sequence modeling for time series Session

Sequence-to-sequence modeling (seq2seq) is now being used for applications based on time series data. Arun Kejariwal and Ira Cohen offer an overview seq2seq and explore its early use cases. They then walk you through leveraging seq2seq modeling for these use cases, particularly with regard to real-time anomaly detection and forecasting.

Robert Cohen is a senior fellow at the Economic Strategy Institute, where he is directing a new study to examine the economic and business impacts of machine learning and AI on firms and the U.S. economy.

Presentations

Data-driven digital transformation and jobs: The new software hierarchy and ML Session

Robert Cohen discusses the skills that employers are seeking from employees in digital jobs, linked to the new software hierarchy driving digital transformation. Robert describes this software hierarchy as one that ranges from DevOps, CI/CD, and microservices to Kubernetes and Istio. This hierarchy is used to define the jobs that are central to data-driven digital transformation.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, he was a data scientist at TIBCO and a statistical software developer at AMD. Ian is a cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow 2-Day Training

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow (Day 2) Training Day 2

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Damon Cortesi is a big data architect at Amazon Web Services, where he helps customers build and deploy data platforms. He spends most of his time exploring the depths of Apache codebases and making data and analytics more accessible. Previously, he was cofounder and CTO of a social analytics startup, where he built the initial data gathering and report generation pipelines and grew the team to over 100 people.

Presentations

Build your own data lake with AWS Glue and Amazon Athena (sponsored by Amazon Web Services) Session

Damon Cortesi demonstrates how to use AWS Glue and Amazon Athena to implement an end-to-end pipeline.

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Giselle Cory is the executive director of DataKind UK, an organization that helps the social sector use data science by bringing together social change organizations and pro bono data scientists. Previously, Giselle worked in the UK government (including in the Prime Minister’s Strategy Unit), and for national charities and think tanks (including the Resolution Foundation), using data to better inform public policy decisions. She believes that smart, responsible data collection and use can help the social sector tackle some of the UK’s biggest challenges. Giselle holds an BSc in maths and physics, a MSc in computational journalism, and diplomas in economics and manned spaceflight.

Presentations

Why is it so hard to do AI for good? Session

DataKind UK has been working in data for good since 2013, helping over 100 UK charities to do data science for the benefit of their users. Some of those projects have delivered above and beyond expectations; others haven't. Duncan Ross and Giselle Cory explain how to identify the right data for good projects and how this can act as a framework for avoiding the same problems across industry.

Lidia Crespo is a technical business manager on the CDO team at Santander UK, where she leads the company’s big data governance activities. She and her team have been instrumental in the adoption of the technology platform by creating a sense of trust and with their deep knowledge of the data of the organization. With her experience in complex and challenging international projection projects and a background in audits, IT, and data, Lidia brings a combination difficult to find.

Presentations

The vindication of big data: How Santander UK uses Hadoop to defend privacy Session

Big data is usually regarded as a menace to data privacy. But with data privacy principles and a customer-first mindset, it can be a game changer. Maurício Lins and Lidia Crespo explain how Santander UK applied this model to comply with GDPR, using graph technology, Hadoop, Spark, and Kudu to drive data obscuring, data portability, and machine learning exploration.

Samuel Cristobal is the science and technology director at Innaxis, where he manages the research agenda of the institute. Samuel has been a researcher at Innaxis for 10 years, over which he successfully executed more than a dozen data science projects in the field of aviation, ranging from mobility to safety, mostly as the technical or scientific coordinator. Previously, he was a research associate fellow at the University of Vienna, working on mathematical research with focus on algebraic geometry, logic, and computer science. Samuel holds an MSc in advanced mathematics and applications from Universidad Autónoma of Madrid, a BCs (with honors) in mathematics from Universidad Complutense de Madrid, and a BEng (valedictorian) in telecommunication systems from Universidad Politécnica de Madrid.

Presentations

Machine learning in aviation is finally taking off DCS

DataBeacon is a multisided data and machine learning platform for the aviation industry. Samuel Cristóbal offers an overview of two of its applications: SmartRunway (a machine learning solution to runway optimization) and SafeOperations (operations safety predictive analytics).

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Findata Day welcome Tutorial

Welcome to the Findata Day tutorial.

Insurance and the gig economy Findata

Strata chair Alistair Croll looks at the changing role of risk and insurance in the on-demand economy. Drawing from examples in car insurance and beyond, Alistair discusses how the prediction of risk and the transient engagement between workers and tasks is changing the industry—and what insurers can do to prepare for it.

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Ifi Derekli is a senior solutions engineer at Cloudera, focusing on helping large enterprises solve big data problems using Hadoop technologies. Her subject-matter expertise is around security and governance, a crucial component of every successful production big data use case. Previously, Ifi was a presales technical consultant at Hewlett Packard Enterprise, where she provided technical expertise for Vertica and IDOL (currently part of Micro Focus). She holds a BS in electrical engineering and computer science from Yale University.

Presentations

Getting ready for GDPR and CCPA: Securing and governing hybrid, cloud, and on-premises big data deployments Tutorial

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Ifigeneia Derekli, Lars George, and Michael Ernest share hands-on best practices for meeting these challenges, with special attention paid to CCPA.

Apurva Desai leads the Dataproc, Composer, and CDAP products on the Data Analytics team at Google. Previously, Apurva led the mobile cloud team at Lenovo/Motorola, built and commercialized the Hadoop distribution at Pivotal Software, and spent six years at Yahoo leading various search and display advertising efforts as well as the Hadoop solutions team. He holds a master’s degree in EE from Simon Fraser University in Canada.

Presentations

Migrating Apache Oozie workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution.

Marcin Detyniecki is senior R&D officer at the Data Innovation Lab at AXA, a professor at the Polish Academy of Science (IBS PAN), and an associate researcher at the computer science laboratory LIP6 at the University Pierre and Marie Curie (UPMC). His research focuses on the emerging challenges of big data. He’s also worked on the usage of new media, with challenges ranging from multimedia information retrieval to image understanding. Several of his developed applications have been deployed in the market, and many were singled out in international competitions such as TrecVid, ImageClef, and MediaEval. Previously, he was a research scientist at the French National Center for Scientific Research (CNRS), a researcher at the University of California at Berkeley and at Carnegie Mellon University (CMU), and a visiting researcher at the University of Florence and at British Telecom Research labs. He’s member of the research and academic council at UPMC, a member of the executive board of SMART Lab, an elected member of the LIP6 laboratory council, and a member of the editorial board of the International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems (IJFUKS). He also funded and participated in the UPMC – Sorbonne Universités Computer Science Colloquium. He has over 90 publications in journals and conference proceedings, including six keynotes. Marcin studied mathematics, physics, and computer science at UPMC in Paris and holds a PhD in artificial intelligence from the same university.

Presentations

Is it possible to regulate machine learning? Dream versus R&D (sponsored by AXA) Session

Marcin Detyniecki offers an overview of the machine learning backend and its possible applications for the insurance business and other businesses based on the power of research merged with business.

Aaron has 15 years of practical experience a managing BI and Data Engineering functions. He has worked in various industries such as Transport, Banking, Retail and Aviation.

Whilst at easyJet Aaron has been responsible for building up the technology data delivery function. Implementing new ways of working inside a traditional IT function helping it make it the “the most data driven airline in the world”.

Presentations

How easyJet transformed to create a listening enterprise data hub in the cloud DCS

easyJet began its data transformation journey back in 2017. Aaronpal Dhanda shares the story of its two-year Data Transformation program to deliver an enterprise data lake and surrounding data fabric toolset to support advanced analytics, machine learning, customer 360 view, and corporate BI and reporting.

Wolff Dobson is a developer programs engineer at Google specializing in machine learning and games. Previously, he worked as a game developer, where his projects included writing AI for the NBA 2K series and helping design the Wii Motion Plus. Wolff holds a PhD in artificial intelligence from Northwestern University.

Presentations

TensorFlow for everyone Session

Wolff Dobson covers the latest in TensorFlow. Whether you're a beginner or are migrating from 1.x to 2.0, you'll learn the best ways to set up your model, feed your data to it, and distribute it for fast training. You'll also discover how TensorFlow has been recently upgraded to be more intuitive.

Harish Doddi is cofunder and CEO of Datatron. Previously, he held roles at Oracle; Twitter, where he worked on open source technologies, including Apache Cassandra and Apache Hadoop, and built Blobstore, Twitter’s photo storage platform; Snap, where he worked on the backend for Snapchat Stories; and Lyft, where he worked on the surge pricing model. Harish holds a master’s degree in computer science from Stanford, where he focused on systems and databases, and an undergraduate degree in computer science from the International Institute of Information Technology in Hyderabad.

Presentations

Model governance and model ops in the enterprise Session

Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them.

David Dogon is a member of the data science team at Van Lanschot Kempen, where he primarily focuses on investments and asset management. David is driven by an interest in the insights and predictive power from data. A bit of an adventurer, he has performed research toward a PhD degree in mechanical engineering at TU Eindhoven in the Netherlands, holds a master’s degree in mechanical engineering from Columbia University in New York, and holds a bachelor’s degree in chemical engineering, which he completed in Cape Town, the same city where he was born.

Presentations

Fraud detection at a financial institution using unsupervised learning and text mining Session

David Dogon dives into a best practice use case for detecting fraud at a financial institution and details a dynamic and robust monitoring system that successfully detects unwanted client behavior. Join in to learn how machine learning models can provide a solution in cases where traditional systems fall short.

Using machine learning for stock picking Session

Alun Biffin and David Dogon explain how machine learning revolutionized the stock-picking process for portfolio managers at Kempen Capital Management by filtering the vast small-cap investment universe down to a handful of optimal stocks.

Mark Donsky is a director of product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogenous data environments, where he leads product management. Previously, Mark led data management and governance solutions at Cloudera, and he’s held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the Western University, Ontario, Canada.

Presentations

Executive Briefing: Big data in the era of heavy worldwide privacy regulations Session

The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance.

Getting ready for GDPR and CCPA: Securing and governing hybrid, cloud, and on-premises big data deployments Tutorial

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Ifigeneia Derekli, Lars George, and Michael Ernest share hands-on best practices for meeting these challenges, with special attention paid to CCPA.

Tal Doron is the director of technology innovation at GigaSpaces, where he bridges the gap between business and technology, architecting, and strategizing digital transformation from ideas to success with strong business impact, and he manages presales activities, engaging with all levels of decision makers from architects to strategic dialogue with C-level executives. Tal brings over a decade of technical experience in enterprise architecture specializing in mission-critical applications with focus on real-time analytics, distributed systems, identity management, fusion middleware, and innovation. Previously, Tal held positions at Dooblo, Enix, Experis BI, and Oracle.

Presentations

A deep learning approach to automatic call routing Session

Technological advancements are transforming customer experience, and businesses are beginning to benefit from deep learning innovations to automate call center routing to the most proper agent. Tal Doron explains how to run deep learning models with Intel BigDL and Spark frameworks colocated on an in-memory computing platform to enhance the customer experience without the need for GPUs

How NLP is helping a European financial institution enhance customer experience Findata

Tal Doron explains how a leading IT service provider for financial firms leverages NLP to match cases on their CRM system to live service calls via case and subject, helping service agents provide first-call resolution quickly and efficiently to enhance customer experience and reduce time the agent spends on the line, lowering operational costs.

Ted Dunning is the chief technology officer at MapR, an HPE company. He’s also a board member for the Apache Software Foundation, a PMC member, and committer on a number of projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Report card on streaming microservices Session

As a community, we have been pushing streaming architectures, particularly microservices, for several years now. But what are the results in the field? Ted Dunning shares several (anonymized) case histories, describing the good, the bad, and the ugly. In particular, Ted covers how several teams who were new to big data fared by skipping MapReduce and jumping straight into streaming.

Maren Eckhoff is a principal data scientist at QuantumBlack, where she leads the analytics work on client projects, working across industries on predictive, explanatory, and optimization problems. Her role includes defining the analytical approach, developing the code base, building models, and communicating the results. Maren also leads the technical training program for QuantumBlack’s data science team and arranges bespoke trainings, seminars, and conference attendance. Previously, Maren worked in demand forecasting. She holds a PhD in probability theory from the University of Bath.

Presentations

Opening the black box: Explainable AI (XAI) Session

The success of machine learning algorithms in a wide range of domains has led to a desire to leverage their power in ever more areas. Maren Eckhoff discusses modern explainability techniques that increase the transparency of black box algorithms, drive adoption, and help manage ethical, legal, and business risks. Many of these methods can be applied to any model without limiting performance.

Geir Engdahl is CTO at Cognite, where he leads the R&D department in developing the Cognite industrial IoT data platform. Previously, Geir was founder and CEO/CTO at Snapsale, a machine learning classifieds startup (acquired by Schibsted), and a senior software engineer at Google in Canada, where he worked on machine learning for AdWords and AdSense, resulting in the Conversion Optimizer product. Geir holds an MSc in computational science from the University of Oslo. He won a silver medal from the International Olympiad in Informatics.

Presentations

Processing 10M samples a second to drive smart maintenance in complex IIoT systems Session

Geir Engdahl and Daniel Bergqvist explain how Cognite is developing IIoT smart maintenance systems that can process 10M samples a second from thousands of sensors. You'll explore an architecture designed for high performance, robust streaming sensor data ingest, and cost-effective storage of large volumes of time series data as well as best practices learned along the way.

Michael Ernest is a partner solution architect at Dataiku, supporting technical integration with cloud platforms. He previously led field-enablement programming at Cloudera, where he developed training for new and tenured hires in Hadoop operations, application architecture, and full stack security. He’s published four books on Java programming and Sun Solaris administration. Ernest lives in Berkeley, California.

Presentations

Getting ready for GDPR and CCPA: Securing and governing hybrid, cloud, and on-premises big data deployments Tutorial

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Ifigeneia Derekli, Lars George, and Michael Ernest share hands-on best practices for meeting these challenges, with special attention paid to CCPA.

Moty Fania is a principal engineer and the CTO of the Advanced Analytics Group at Intel, which delivers AI and big data solutions across Intel. Moty has rich experience in ML engineering, analytics, data warehousing, and decision-support solutions. He led the architecture work and development of various AI and big data initiatives such as IoT systems, predictive engines, online inference systems, and more.

Presentations

Building a sales AI platform: Key principles and lessons learned Session

Moty Fania shares his experience implementing a sales AI platform that handles processing of millions of website pages and sifts through millions of tweets per day. The platform is based on unique open source technologies and was designed for real-time data extraction and actuation.

Dr Alberto Favaro is a senior data scientist at faculty. He has led data science projects in the energy, financial services and retail sectors. His areas of expertise include distributed computing for big data, deep learning, and Bayesian statistics. He has extensive experience using TensorFlow, Dask, and MongoDB. He was previously a theoretical physicist, and held research posts in the UK, at Imperial College London, and in Germany, at the Universities of Oldenburg and Cologne. His research was included among the ‘Top 10 breakthroughs of 2011’ by the magazine Physics World.

Presentations

AI for managers 2-Day Training

Nijma Khan and Alberto Favaro offer a condensed introduction to key AI and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

AI for managers (Day 2) Training Day 2

Nijma Khan and Alberto Favaro offer a condensed introduction to key AI and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Presentations

Insightful health: Amplifying intelligence in healthcare patient flow execution Session

Fabio Ferraretto and Claudia Regina Laselva explain how Hospital Albert Einstein and Accenture evolved patient flow experience and efficiency with the use of applied AI, statistics, and combinatorial math, allowing the hospital to anticipate E2E visibility within patient flow operations, from admission of emergency and elective demands to assignment and medical releases.

Piotr Findeisen is a software engineer and a founding member of the team at Starburst. He contributes to the Presto code base and is also active in the community. Piotr has been involved in the design and development of significant features like the Cost-Based Optimizer (still in development), spill to disk, correlated subqueries, and a plethora of smaller enhancements. Previously, Piotr worked at Teradata, where he was the top external Presto committer, and was a team leader at Syncron (a provider of cloud services for supply chain management), responsible for the product’s technical foundation and performance. Piotr holds an MS in computer science and a BSc in mathematics from the University of Warsaw.

Presentations

The Presto Cost-Based Optimizer for interactive SQL on anything Session

Presto is a popular open source–distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3, Azure ADSL, RDBMS, NoSQL, etc). Wojciech Biela and Piotr Findeisen offer an overview of the Cost-Based Optimizer (CBO) for Presto, which brings a great performance boost. Join in to learn about CBO internals, the motivating use cases, and observed improvements.

Daniel First is a data scientist at QuantumBlack. He’s worked with doctors and healthcare companies to design innovative, data-driven solutions to improve outcomes for patients by forecasting and preventing medical risks, and he’s also developed an approach to operationalize risk management for teams building machine learning models. He publishes on the social and political impact of artificial intelligence and speaks about the importance of making machine learning algorithms’ decisions interpretable to humans, most recently at the University of Oxford Mathematical Institute. He holds a BA in cognitive science and neuroscience from Yale University, an MPhil in philosophy from the University of Cambridge, where he specialized in the history of ethical thought, and an MS in data science from Columbia University.

Presentations

Operationalizing risk management for machine learning Findata

Daniel First details QuantumBlack’s innovative methodology for identifying and mitigating risk in the development and deployment of machine learning at scale.

Marcel Ruiz Forns is a software engineer on the analytics team at the Wikimedia Foundation. He believes it’s a privilege to be able to professionally contribute to Wikipedia and the free knowledge movement. He’s also worked on quite disparate things such as recommender systems, serious games, natural language processing, and…selling hand-painted T-shirts on the beach of Natal, Brazil.

Presentations

The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit Session

Analysts and researchers studying Wikipedia are hungry for long-term data to build experiments and feed data-driven decisions. But Wikipedia has a strict privacy policy that prevents storing privacy-sensitive data over 90 days. Marcel Ruiz Forns explains how the Wikimedia Foundation's analytics team is working on a vegan data diet to satisfy both.

Michael J. Freedman is the cofounder and CTO of TimescaleDB and a full professor of computer science at Princeton University. His work broadly focuses on distributed and storage systems, networking, and security, and his publications have more than 12,000 citations. He developed CoralCDN (a decentralized content distribution network serving millions of daily users) and helped design Ethane (which formed the basis for OpenFlow and software-defined networking). Previously, he cofounded Illuminics Systems (acquired by Quova, now part of Neustar) and served as a technical advisor to Blockstack. Michael’s honors include a Presidential Early Career Award for Scientists and Engineers (given by President Obama), the SIGCOMM Test of Time Award, a Sloan Fellowship, an NSF CAREER award, the Office of Naval Research Young Investigator award, and support from the DARPA Computer Science Study Group. He earned his PhD at NYU and Stanford and his undergraduate and master’s degrees at MIT.

Presentations

Performant time series data management and analytics with PostgreSQL Session

Time series databases require ingesting high volumes of structured data, answering complex, performant queries for recent and historical time intervals, and performing specialized time-centric analysis and data management. Michael Freedman explains how to avoid these operational problems by reengineering Postgres to serve as a general data platform, including high-volume time series workloads.

Michael Freeman is a senior lecturer at the Information School at the University of Washington, where he teaches courses on data science, data visualization, and web development. With a background in public health, Michael works alongside research teams to design and build interactive data visualizations to explore and communicate complex relationships in large datasets. Previously, he was a data visualization specialist and research fellow at the Institute for Health Metrics and Evaluation, where he performed quantitative global health research and built a variety of interactive visualization systems to help researchers and the public explore global health trends. Michael is interested in applications of data visualization to social change. He holds a master’s degree in public health from the University of Washington. You can find samples from his projects on his website.

Presentations

Visually communicating statistical and machine learning methods Session

Statistical and machine learning techniques are only useful when they're understood by decision makers. While implementing these techniques is easier than ever, communicating about their assumptions and mechanics is not. Michael Freeman details a design process for crafting visual explanations of analytical techniques and communicating them to stakeholders.

Brandy Freitas is a principal data scientist at Pitney Bowes, where she works with clients in a wide variety of industries to develop analytical solutions for their business needs. Brandy is a research-physicist-turned-data-scientist based in Boston, Massachusetts. Her academic research focused primarily on protein structure determination, applying machine learning techniques to single-particle cryoelectron microscopy data. Brandy is a National Science Foundation Graduate Research Fellow and a James Mills Pierce Fellow. She holds an undergraduate degree in physics and chemistry from the Rochester Institute of Technology and did her graduate work in biophysics at Harvard University.

Presentations

Executive Briefing: Analytics for executives Session

Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. Brandy Freitas walks you through developing context and vocabulary around data science topics to help build a culture of data within your organization.

Ellen Friedman is a data technologist with a Ph.D. in biochemistry. She is a committer for Apache Drill and Apache Mahout projects and co-author of books including AI & Analytics in Production, Machine Learning Logistics, Streaming Architecture, the Practical Machine Learning series, and Introduction to Apache Flink, all published by O’Reilly Media. Ellen has been a keynote speaker at JFokus in Stockholm, Big Data London and NoSQL Matters Barcelona and an invited speaker at Strata Data conferences, Berlin Buzzwords, Nike Tech Talks, and the University of Sheffield Methods Institute.

Presentations

Executive Briefing: 5 things every executive should NOT know Session

A surprising fact of modern technology is that not knowing some things can make you better at what you do. This isn’t just lack of distraction or being too delicate to face reality. It’s about separation of concerns, with a techno flavor. Ellen Friedman outlines five things that best practice with emerging technologies and new architectures can give us ways to not know—and why that’s important.

Matt Fuller is cofounder at Starburst, the Presto company. Previously, Matt has held engineering roles in the data warehousing and analytics space for the past 10 years, including director of engineering at Teradata, where he led engineering teams working on Presto and was part of the team that led the initiative to bring open source, in particular Presto, to Teradata’s products; architected and led development efforts for the next-generation distributed SQL engine at Hadapt (acquired by Teradata in 2014); and was an early engineer at Vertica (acquired by HP), where he worked on the query optimizer.

Presentations

Learning Presto: SQL on anything Tutorial

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today.

Marina Rose Geldard (Mars) is a technologist from Down Under in Tasmania. Entering the world of technology relatively late as a mature-age student, she has found her place in the world: an industry where she can apply her lifelong love of mathematics and optimization. She compulsively volunteers at industry events, dabbles in research, and serves on the executive committee for her state’s branch of the Australian Computer Society (ACS) as well as the AUC. She’s writing Practical Artificial Intelligence with Swift for O’Reilly and working on machine learning projects to improve public safety through public CCTV cameras in her hometown of Hobart.

Presentations

Science-fictional user interfaces Session

Science fiction has been showcasing complex, AI-driven interfaces for decades. As TV, movies, and video games have become more capable of visualizing a possible future, the grandeur of these imagined science fictional interfaces has increased. Mars Geldard and Paris Buttfield-Addison investigate what we can learn from Hollywood UX. Is there a useful takeaway? Does sci-fi show the future of AI UX?

Lars George is the principal solutions architect at Okera. Lars has been involved with Hadoop and HBase since 2007 and became a full HBase committer in 2009. Previously, Lars was the EMEA chief architect at Cloudera, acting as a liaison between the Cloudera professional services team and customers as well as partners in and around Europe, building the next data-driven solutions, and a cofounding partner of OpenCore, a Hadoop and emerging data technologies advisory firm. He has spoken at many Hadoop User Group meetings as well as at conferences such as ApacheCon, FOSDEM, QCon, and Hadoop World and Hadoop Summit. He also started the Munich OpenHUG meetings. He’s the author HBase: The Definitive Guide from O’Reilly.

Presentations

Getting ready for GDPR and CCPA: Securing and governing hybrid, cloud, and on-premises big data deployments Tutorial

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Ifigeneia Derekli, Lars George, and Michael Ernest share hands-on best practices for meeting these challenges, with special attention paid to CCPA.

Oliver Gindele is head of machine learning at Datatonic. Oliver is passionate about using computers models to solve real-world problems. Working with clients in retail, finance, and telecommunications, he applies deep learning techniques to tackle some of the most challenging use cases in these industries. He studied materials science at ETH Zurich and holds a PhD in computational physics from UCL.

Presentations

Deep learning for recommender systems Session

The success of deep learning has reached the realm of structured data in the past few years, where neural networks have been shown to improve the effectiveness and predictability of recommendation engines. Oliver Gindele offers a brief overview of such deep recommender systems and explains how they can be implemented in TensorFlow.

Emily Gorcenski is lead data scientist at ThoughtWorks. Emily has over 10 years of experience in scientific computing and engineering research and development. Her background is in mathematical analysis, with a focus on probability theory and numerical analysis. She’s currently working in Python development, though she also has experience with C#/.Net, Unity3D, SQL, and MATLAB as well as statistics and experimental design. Previously, she was principal investigator in a number of clinical research projects.

Presentations

Continuous intelligence: Keeping your AI application in production Session

Machine learning can be challenging to deploy and maintain. Any delays in moving models from research to production mean leaving your data scientists' best work on the table. Arif Wider and Emily Gorcenski explore continuous delivery (CD) for AI/ML along with case studies for applying CD principles to data science workflows.

Ever since the completion of her studies, Caroline Goulard has nurtured a passion for how information can be expressed, shared, and understood. In 2010, sensing that the rich data era will transform the way we work, learn, and communicate, she cofounded Dataveyes, a studio specialized in human-data interactions, where she translates data into interactive experiences in order to reveal new insightful stories, accompany new uses, and understand our environment shaped by data and algorithms.

Presentations

When you don’t really know what to do with this huge pile of strategic data DCS

Caroline Goulard demonstrates how to leverage data to rethink the way an industry builds its offering, improve customer experience, and feed its prospective reflection. Caroline focuses on the approach, the methodology, and the results achieved for a leading public transport operator.

Sonal Goyal is the founder and CEO at Nube Technologies, a startup focused on big data preparation and analytics. Nube Technologies builds business applications for better decision making through better data. Sonal and the team at Nube help customers build better and effective models by ensuring that their underlying master data is accurate. The company’s fuzzy matching product, Reifier, helps companies get a holistic view of enterprise data. By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data.

Presentations

Mastering data with Spark and machine learning Session

Enterprise data on customers, vendors, and products is often siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting, and 360 views. Traditional rule-based MDM systems with legacy architectures struggle to unify this growing data. Sonal Goyal offers an overview of a modern master data application using Spark, Cassandra, ML, and Elastic.

Trevor Grant is a computer nerd at IBM, an Apache Software Foundation Member, and is involved in multiple projects such as Mahout, Streams, and SDAP-incubating, just to name a few. He speaks about computer stuff internationally. He’s taken numerous classes in stand-up and improv comedy to make his talks more pleasant for you—the listener. He holds an MS in applied math and an MBA from Illinois State University.

Presentations

Cross-cloud model training and serving with Kubeflow Tutorial

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud.

Jay Green is a final-year student in computer science at King’s College London. She joined the big data platform team at Hotels.com for her industrial placement year, where she’s worked with Apache Hive, modularization techniques for SQL, and mutation testing tools.

Presentations

Mutant tests too: The SQL Session

Elliot West and Jay Green share approaches for applying software engineering best practices to SQL-based data applications to improve maintainability and data quality. Using open source tools, Elliot and Jay show how to build effective test suites for Apache Hive code bases and offer an overview of Mutant Swarm, a tool to identify weaknesses in tests and to measure SQL code coverage.

Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Disrupting data discovery Session

Mark Grover discusses how Lyft has reduced the time it takes to discover data by 10 times by building its own data portal, Amundsen. Mark gives a demo of Amundsen, leads a deep dive into its architecture, and discusses how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. Mark closes with a future roadmap, unsolved problems, and collaboration model.

The Lyft data platform: Now and in the future Session

Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.

Luke (Qing) Han is a cofounder and CEO of Kyligence, cocreator and PMC chair of Apache Kylin, the leading open source OLAP for big data, and a Microsoft regional director and MVP. Luke has 10+ years’ experience in data warehouses, business intelligence, and big data. Previously, he was big data product lead at eBay and chief consultant of Actuate China.

Presentations

Augmented OLAP for big data from on-premises to multicloud (sponsored by Kyligence) Session

Augmenting data management and analytics platforms with artificial intelligence and machine learning is game changing for analysts, engineers, and other users. It enables companies to optimize their storage, speed, and spending. Luke Han explains how the Kyligence platform is evolving to the next level, with augmented capabilities such as intelligent modeling, smart pushdowns, and more.

Sven Hansen is a solution architect at AWS focused on helping customers in the financial services domain solve technical and business challenges. He has extensive work experience supporting large enterprises across both sub-Saharan Africa and Europe in areas relating to security, networking and analytics technologies. He holds an MBA from the Heriot-Watt Business School in Edinburgh. He’s also an avid musician and would like to spend more time playing guitar and bass.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Nischal HP is the vice president of engineering at Berlin-based AI startup omnius, which builds AI products for the insurance industry. Nischal is also a mentor for data science on Springboard. Previously, he was a cofounder and data scientist at Unnati Data Labs, where he helped build end-to-end data science systems in the fields of fintech, marketing analytics, event management, and medicine. During his tenure at former companies like Redmart and SAP, he was involved in architecting and building software for ecommerce systems in catalog management, recommendation engines, sentiment analyzers, data crawling frameworks, intention mining systems, and gamification of technical indicators for algorithmic trading platforms. Nischal has conducted workshops in the field of deep learning and has spoken at a number of data science conferences like the Strata Data Conference in San Jose 2017, PyData London 2016, PyCon Czech Republic 2015, The Fifth Elephant India (2015 and 2016), and Anthill Bangalore 2016. He’s a strong believer in open source and loves to architect big, fast, and reliable AI systems. In his free time, he enjoys traveling with his significant other, music, and grokking the web.

Presentations

Deep learning for fonts Session

Deep learning has enabled massive breakthroughs in offbeat tracks and has enabled better understanding of how an artist paints, how an artist composes music, and so on. Nischal Harohalli Padmanabha and Raghotham Sripadraj discuss their project Deep Learning for Humans and their plans to build a font classifier.

Christian Hidber is a software engineer at bSquare, where he applies machine learning to industrial hydraulics simulation, part of a product with 7,000 installations in 42 countries. He holds a PhD in computer algebra from ETH Zurich, which he followed with a postdoc at UC Berkeley, where he researched online data mining algorithms.

Presentations

Reinforcement learning: A gentle introduction and an industrial application Session

Reinforcement learning (RL) learns complex processes autonomously like walking, beating the world champion in Go, or flying a helicopter. No big datasets with the “right” answers are needed: the algorithms learn by experimenting. Christian Hidber shows how and why RL works and demonstrates how to apply it to an industrial hydraulics application with 7,000 clients in 42 countries.

Ana Hocevar is a data scientist in residence at the Data Incubator, where she combines her love for coding and teaching. Ana has more than a decade of experience in physics and neuroscience research and over five years of teaching experience. Previously, she was a postdoctoral fellow at the Rockefeller University, where she worked on developing and implementing an underwater touch screen for dolphins. She holds a PhD in physics.

Presentations

Machine learning from scratch in TensorFlow 2-Day Training

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Ana Hocevar offers an intro to TensorFlow's capabilities in Python, taking you from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow with several hands-on applications.

Machine learning from scratch in TensorFlow (Day 2) Training Day 2

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Ana Hocevar offers an intro to TensorFlow's capabilities in Python, taking you from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow with several hands-on applications.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

Protecting sensitive data in huge datasets: Cloud tools you can use Session

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa explores how to handle massive public datasets, taking you from theory to real life as he showcases newly available tools that help with PII detection and bring concepts like k-anonymity and l-diversity to the practical realm.

Mick leads Cloudera’s worldwide marketing efforts, including advertising, brand, communications, demand, partner, solutions, and web. Mick has had a successful 25-year career in enterprise and cloud software. Previously, he was CMO at sales acceleration and machine learning company InsideSales.com, helping the company pioneer a shift to data-driven marketing and sales that has served as a model for organizations around the globe; served as global vice president of marketing and strategy at Citrix, where he led the company’s push into the high-growth desktop virtualization market; managed executive marketing at Microsoft; and held numerous leadership positions at IBM Software. Mick is an advisory board member for InsideSales and a contributing author on Inc.com. He’s also an accomplished public speaker who has shared his insightful messages about the business impact of technology with audiences around the world. Mick holds a BS in management from the Georgia Institute of Technolgy.

Presentations

Executive Briefing: From the edge to AI—Taking control of your data for fun and profit Session

Managing your data securely is difficult, as is choosing the right machine learning tools and managing models and applications in compliance with regulation and law. Mick Hollison covers the risks and the issues that matter most and explains how to address them with an enterprise data cloud and by embracing your data center and the public cloud in combination.

The enterprise data cloud Keynote

The last decade has seen incredible changes in our technology. The advent of big data and powerful new analytic techniques, including machine learning and AI, means that we understand the world in ways that were simply impossible before. The simultaneous explosion of public cloud services has fundamentally changed our expectations of technology: it should be fast, simple, and flexible to use.

Matthew Honnibal is the creator and lead developer of spaCy, one of the most popular libraries for natural language processing. He’s been publishing research on NLP since 2005, with a focus on syntactic parsing and other structured prediction problems. He left academia to start working on spaCy in 2014.

Presentations

Agile NLP workflows with spaCy and Prodigy Session

Matthew Honnibal shares "one weird trick" that can give your NLP project a better chance of success: avoid a waterfall methodology where data definition, corpus construction, modeling, and deployment are performed as separate phases of work.

Christopher Hooi is the deputy director of communications and sensors at the Land Transport Authority of Singapore. He’s passionate about harnessing big data innovations to address complex land transport issues. Since 2010, he has embarked on a long-term digital strategy with the main aim of achieving smart urban mobility in a fast-changing digital world. Central to this strategy is building and sustaining a land transport digital ecosystem through an extensive network of sensor feeds, analytical processes, and commuter outreach channels, synergistically put together to deliver a people-centered land transport system.

Presentations

Early incident detection using fusion analytics of commuter-centric data sources Session

Christopher Hooi offers an overview of the Fusion Analytics for Public Transport Event Response (FASTER) system, a real-time advanced analytics solution for early warning of potential train incidents. FASTER uses engineering and commuter-centric IoT data sources to activate contingency plans at the earliest possible time and reduce impact to commuters.

Joel Horwitz is senior vice president of marketing at WANdisco. Joel is an experienced high-tech marketing professional with a diverse background in research and development, product strategy, and corporate development. Previously, he was the global vice president of strategic partnerships and offerings for IBM’s Digital Business Group and led the formation of IBM’s data science and machine learning product portfolio through strategic marketing and partner ecosystem development. He also delivered accretive growth at various data and analytics startups, including AVG Technologies, Datameer, Alpine Data Labs, and H2O.ai, through the introduction of platform partnerships, self-service offerings, and digital marketing. He has launched 18 new products generating $1+ billion in revenue with 50+ partnerships and acquisitions and 1M+ developers, bridging corporate initiatives with startup innovation to capture market opportunity.

Presentations

How a LiveData strategy breaks down barriers to overcome data gravity (sponsored by WANdisco) Session

Joel Horwitz shares best practices WANdisco clients have taken to evolve their data architecture to become a LiveData company.

Shant Hovsepian is a cofounder and CTO of Arcadia Data, where he’s responsible for the company’s long-term innovation and technical direction. Previously, Shant was an early member of the engineering team at Teradata, which he joined through the acquisition of Aster Data, and he interned at Google, where he worked on optimizing the AdWords database. His experience includes everything from Linux kernel programming and database optimization to visualization. He started his first lemonade stand at the age of four and ran a small IT consulting business in high school. Shant studied computer science at UCLA, where he had publications in top-tier computer systems conferences.

Presentations

Intelligent design patterns for cloud-based analytics and BI (sponsored by Arcadia Data) Session

With cloud object storage, you may expect business intelligence (BI) applications to benefit from the scale of data and real-time analytics, but traditional BI in the cloud faces not-so-obvious challenges. Shant Hovsepian discusses considerations for service-oriented cloud design and shows how native cloud BI provides analytic depth, low cost, and high performance.

Alexandre Hubert is one of Dataiku’s top data scientists, but he began his career in a very different domain: working as a trader. He soon realized that with the huge amount of data out there, it was possible (and fun) to resolve problems using real-life data. Since becoming a data scientist, Alexandre has worked on a range of use cases, from creating models that predict fraud to building specific recommendation systems. He especially loves using deep learning with text or sports data. Even when he’s having fun with friends, Alexandre sees numbers and patterns everywhere, bringing him quickly back to his laptop to try out new ideas.

Presentations

Improving infrastructure efficiency with unsupervised algorithms Session

GRDF helps bring natural gas to nearly 11 million customers every day. Alexandre Hubert explains how, in partnership with GRDF, Dataiku worked to optimize the manual process of qualifying addresses to visit and ultimately save GRDF time and money. This solution was the culmination of a yearlong adventure in the land of maintenance experts, legacy IT systems, and Agile development.

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He’s a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he’s an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Solving data cleaning and unification using human-guided machine learning Session

Last year, Ihab Ilyas covered two primary challenges in applying machine learning to data curation: entity consolidation and using probabilistic inference to suggest data repair for identified errors and anomalies. This year, he explores these limitations in greater detail and explains why data unification projects quickly require human-guided machine learning and a probabilistic model.

Rashed Iqbal is the cofounder of the field of narrative economics and is chief technology officer for a government investment fund in the UAE. He believes narrative modeling will revolutionize the process of human communication. Previously, he held technology and management roles at Teledyne Technologies, Western Digital Corporation, Synopsys, and other companies. Rashed teaches graduate courses in data science in the Economics Department at UCLA and also teaches at UC Irvine and UCLA Extension. His areas of interest include text analytics, natural language understanding, and Lean and Agile development. Rashed has led multiple entrepreneurial ventures in data science. He holds a PhD in systems engineering from the University of Sheffield, UK.

Presentations

Modeling the Tesla narrative Findata

Despite fierce challenges, Tesla has upended not only the automotive and technology sectors but also our perception of disruption itself. Tesla and its enigmatic CEO, Elon Musk, have consistently used narratives to support their brand and market valuation. Rashed Iqbal shares a case study in the application of narrative modeling to news and social media content about Tesla since its inception.

Amir Issaei is a data science consultant at Databricks, where he educates customers on how to leverage the company’s Unified Analytics Platform in machine learning (ML) projects. He also helps customers implement ML solutions and use advanced analytics to solve business problems. Previously, he worked in the Operations Research Department at American Airlines, where he supported the Customer Planning, Airport, and Customer Analytics Groups. He holds an MS in mathematics from the University of Waterloo and a BE in physics from the University of British Columbia.

Presentations

Large-scale ML with MLflow, deep learning, and Apache Spark 2-Day Training

Join Amir Issaei to explore neural network fundamentals and learn how to build distributed Keras/TensorFlow models on top of Spark DataFrames. You'll use Keras, TensorFlow, Deep Learning Pipelines, and Horovod to build and tune models and MLflow to track experiments and manage the machine learning lifecycle. This course is taught entirely in Python.

Large-scale ML with MLflow, deep learning, and Apache Spark (Day 2) Training Day 2

Join Amir Issaei to explore neural network fundamentals and learn how to build distributed Keras/TensorFlow models on top of Spark DataFrames. You'll use Keras, TensorFlow, Deep Learning Pipelines, and Horovod to build and tune models and MLflow to track experiments and manage the machine learning lifecycle. This course is taught entirely in Python.

Maryam Jahanshahi is a research scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD from the Icahn School of Medicine at Mount Sinai, where she studied molecular regulators of organ-size control. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of computation linguistics, machine learning, and behavioral economics methods.

Presentations

The evolution of data science skill sets: An analysis using exponential family embeddings Session

Maryam Jahanshahi explores exponential family embeddings: methods that extend the idea behind word embeddings to other data types. You'll learn how TapRecruit used dynamic embeddings to understand how data science skill sets have transformed over the last three years, using its large corpus of job descriptions, and more generally, how these models can enrich analysis of specialized datasets.

Alejandro (Alex) Jaimes is senior vice president of AI and data science at Dataminr. His work focuses on mixing qualitative and quantitative methods to gain insights on user behavior for product innovation. Alex is a scientist and innovator with 15+ years of international experience in research leading to product impact at companies including Yahoo, KAIST, Telefónica, IDIAP-EPFL, Fuji Xerox, IBM, Siemens, and AT&T Bell Labs. Previously, Alex was head of R&D at DigitalOcean, CTO at AiCure, and director of research and video products at Yahoo, where he managed teams of scientists and engineers in New York City, Sunnyvale, Bangalore, and Barcelona. He was also a visiting professor at KAIST. He has published widely in top-tier conferences (KDD, WWW, RecSys, CVPR, ACM Multimedia, etc.) and is a frequent speaker at international academic and industry events. He holds a PhD from Columbia University.

Presentations

AI for good at scale in real time: Challenges in machine learning and deep learning Session

When emergency events occur, social signals and sensor data are generated. Alex Jaimes explains how to apply machine learning and deep learning to process large amounts of heterogeneous data from various sources in real time, with a particular focus on how such information can be used for emergencies and in critical events for first responders and for other social good use cases.

Dave Josephsen runs the telemetry engineering team at Sparkpost. He thinks you’re pretty great.

Presentations

Schema on read and the new logging way Session

David Josephsen tells the story of how Sparkpost's reliability engineering team abandoned ELK for a DIY schema-on-read logging infrastructure. Join in to learn the architectural details, trials, and tribulations from the company's Internal Event Hose data ingestion pipeline project, which uses Fluentd, Kinesis, Parquet, and AWS Athena to make logging sane.

Yiannis Kanellopoulos has spent the better part of two decades analyzing and evaluating software systems in order to help organizations address any potential risks and flaws related to them. (In his experience, these risks or flaws are always due to human involvement.) With Code4Thought, Yiannis is turning his expertise into democratizing technology by rendering algorithms transparent and helping organizations become accountable. Targeted outcomes of his work include building trust between the organization utilizing the algorithms and those affected by its output and rendering the algorithms more persuasive, since their reasoning will be easier to explain. He’s also a founding member of Orange Grove Patras, a business incubator sponsored by the Dutch Embassy in Greece to promote entrepreneurship and counter youth unemployment. Yiannis holds a PhD in computer science from the University of Manchester.

Presentations

On the accountability of black boxes: How we can control what we can’t exactly measure Findata

Black box algorithmic systems make decisions that have a great impact in our lives. Thus, the need for their accountability and transparency is growing. Code4Thought created an evaluation model reflecting the state of practice in several organizations. Yiannis Kanellopoulos explores this model and shares lessons learned from its application at a financial corporation.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Autoscaling Spark on Kubernetes Session

In the Kubernetes world, where declarative resources are a first-class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice, we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova and Holden Karau for a fun adventure.

Cross-cloud model training and serving with Kubeflow Tutorial

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud.

Improving Spark downscaling; Or, Not throwing away all of our work Session

As more workloads move to severless-like environments, the importance of properly handling downscaling increases. Holden Karau, Mikayla Konst, and Ben Sidhom explore approaches for improving the scale-down experience on open source cluster managers—everything from how to schedule jobs to the location of blocks and their impact.

Rohit Karlupia is a technical director at Qubole, where his primary focus is making big data as a service debuggable, scalable, and performant. His current work includes SparkLens (open source Spark profiler) and GC/CPU-aware task scheduling for Spark and Qubole Chunked Hadoop File System. Rohit’s primary research interests are performance and scalability of cloud applications. Over his career, he’s mainly been writing high-performance server applications and has deep expertise in messaging, API gateways, and mobile applications. He holds a bachelors of technology in computer science and engineering from IIT Delhi.

Presentations

Scalability-aware autoscaling of a Spark application Session

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs at the same time. Scalability-aware autoscaling uses historical information to make better scaling decisions. Anirudha Beria and Rohit Karlupia explain how to measure the efficiency of autoscaling policies and discuss more efficient autoscaling policies, in terms of latency and costs.

Nate Keating is a product manager at Google working on the Cloud AI Platform, which includes Cloud Training and Prediction, AI Hub, Kubeflow, Kubeflow Pipelines, and more to come. Previously, he was manager of IBM’s Applied AI team, which built first-of-a-kind AI solutions for large and strategic clients across all industry verticals and problem spaces. He holds a degree in economics and finance from the Duke University.

Presentations

Mass production of AI solutions Session

AI will change how we live in the next 30 years, but it's still currently limited to a small group of companies. In order to scale the impact of AI across the globe, we need to reduce the cost of building AI solutions, but how? Nate Keating explains how to apply lessons learned from other industries—specifically, the automobile industry, which went through a similar cycle.

Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Architecture and algorithms for end-to-end streaming data processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams.

Model serving via Pulsar functions Session

Arun Kejariwal and Karthik Ramasamy walk you through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same.

Sequence-to-sequence modeling for time series Session

Sequence-to-sequence modeling (seq2seq) is now being used for applications based on time series data. Arun Kejariwal and Ira Cohen offer an overview seq2seq and explore its early use cases. They then walk you through leveraging seq2seq modeling for these use cases, particularly with regard to real-time anomaly detection and forecasting.

Ivan Kelly is a software engineer at Streamlio, a startup dedicated to providing a next-generation integrated real-time stream processing solution, based on Heron, Apache Pulsar (incubating), and Apache BookKeeper. Ivan has been active in Apache BookKeeper since its very early days as a project in Yahoo! Research Barcelona. Specializing in replicated logging and transaction processing, he is currently focused on Streamlio’s storage layer.

Presentations

Architecture and algorithms for end-to-end streaming data processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams.

Infinite retention using storage offloading with Apache Pulsar Session

This talk discusses how Apache Pulsar provides infinite retention of events in topics. We will discuss how the segment oriented architecture allows unlimited topic growth, how you can keep costs down by using tiered storage and how you can run ad-hoc queries on the topic using SQL.

Ganes Kesari is a cofounder and head of analytics at Gramener, where he leads analytics and innovation in data science, advising enterprises on deriving value from data science initiatives and leading applied research in deep learning at Gramener AI Labs. He’s passionate about the confluence of machine learning, information design, and data-driven business leadership and strives to simplify and demystify data science.

Presentations

AI for social good: Saving the planet through data science DCS

Global environmental challenges have pushed our planet to the brink of disaster. Rapid advances in deep learning are placing immense power in the hands of consumers and enterprises. Ganes Kesari explains how this power can be marshaled to support environmental groups and researchers who need immediate assistance to address the rapid depletion of our rich biodiversity.

Jay Kesavan is the head of the Analytics Practice at Bowery Analytics LLC, where he works with clients to devise predictive analytics strategies for executive decision makers. This involves advising companies on different modeling techniques, data transformation and visualization needs, and the software and human resources needed to execute analytics projects. He has spent the last 14 years working with Fortune 100 clients across industries executing large-scale transformation projects in CRM, order management, pricing engines, customer management systems, and advanced marketing solutions. He holds a BS in computer science from Andrews University, an MS in business analytics from NYU, and an MS in tech management from Columbia University as well as a certificate in leadership from IE, Madrid.

Presentations

Evaluating cybersecurity defenses with a data science approach Session

Cybersecurity analysts are under siege to keep pace with the ever-changing threat landscape. The analysts are overworked as they are bombarded with and burned out by the sheer number of alerts that they must carefully investigate. Brennan Lodge and Jay Kesavan explain how to use a data science model for alert evaluations to empower your cybersecurity analysts.

Nijma Khan is a Principal at faculty. She has fifteen years strategy experience. Over her career she has focused on helping organisations combine commercial, social and environmental value across the retail, consumer goods and telecoms sectors, as well as working with organisations like the World Economic Forum and the United Nations.

Prior to joining faculty, Nijma was responsible for Strategy, Insights and Innovation at Accenture with a particular focus on the impact of automation on learning and work, and the practical application of AI and emerging technologies for good.

Presentations

AI for managers 2-Day Training

Nijma Khan and Alberto Favaro offer a condensed introduction to key AI and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

AI for managers (Day 2) Training Day 2

Nijma Khan and Alberto Favaro offer a condensed introduction to key AI and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Seonmin Kim is a senior data risk analyst at LINE, where he’s a key member of the trust and safety team that handles payment fraud and content abuse using data analytics. He has over nine years of extensive experience identifying fraud and abuse risk across various business domains. His primary focus is on AI and machine learning for payment fraud and abuse risk.

Presentations

How to mitigate mobile fraud risk by data analytics Session

Seonmin Kim offers an introduction to activities that mitigate the risk of mobile payments through various data analytical skills, drawn from actual case studies of mobile frauds, along with tree-based machine learning, graph analytics, and statistical approaches.

Melinda King is a Google Authorized Trainer at ROI Training, 2017’s Google Cloud Training Partner of the Year. Melinda brings 30+ years of progressive experience with a unique combination of technical, managerial, and organizational skills. She has done solution design, development, and implementation using Google products including Compute Engine, App Engine, Kubernetes, Bigtable, Spanner, BigQuery, Pub/Sub, Dataflow, and Dataproc. Her expertise includes applying data science algorithms on big data to produce insights for optimizing business decisions. Melinda is also a Microsoft Certified Trainer with certifications for Azure, SQL Server, and Data Management and Analytics. Melinda spent 20+ years serving as a member of the US Marine Corps.

Presentations

Serverless machine learning with TensorFlow: Part I Tutorial

Melinda King offers an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hands-on labs, you’ll learn machine learning (ML) and TensorFlow concepts, and develop skills in developing, evaluating, and productionizing ML models.

Serverless machine learning with TensorFlow: Part II Tutorial

Melinda King offers an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hands-on labs, you’ll learn machine learning (ML) and TensorFlow concepts and develop skills in developing, evaluating, and productionizing ML models.

Michael Kohs is a product manager at Cloudera.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses a number of challenges. Join Colm Moynihan, Jonathan Seidman, and Michael Kohs to explore cloud architecture and challenges and learn how to use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Mikayla Konst is a software engineer on the Cloud Dataproc team at Google. She helped launch Dataproc’s high availability mode and the Workflow Templates API. She’s currently working on improvements to shuffle and autoscaling.

Presentations

Improving Spark downscaling; Or, Not throwing away all of our work Session

As more workloads move to severless-like environments, the importance of properly handling downscaling increases. Holden Karau, Mikayla Konst, and Ben Sidhom explore approaches for improving the scale-down experience on open source cluster managers—everything from how to schedule jobs to the location of blocks and their impact.

Gabor Kotalik is a big data project lead at Deutsche Telekom, where he’s responsible for continuous improvement of customer analytics and machine learning solutions for the commercial roaming business. He has more than 10 years of experience in business intelligence and advanced analytics focusing on using insights and enabling data-driven business decisions.

Presentations

Data science at Deutsche Telekom: Predicting global travel patterns and network demand Session

Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management.

Cassie Kozyrkov is Google Cloud’s chief decision scientist. Cassie is passionate about helping everyone make better decisions through harnessing the beauty and power of data. She speaks at conferences and meets with leadership teams to empower decision makers to transform their industries through AI, machine learning, and analytics. At Google, Cassie has advised more than a hundred teams on statistics and machine learning, working most closely with research and machine intelligence, Google Maps, and ads and commerce. She has also personally trained more than 15,000 Googlers (executives, engineers, scientists, and even nontechnical staff members) in machine learning, statistics, and data-driven decision making. Previously, Cassie spent a decade working as a data scientist and consultant. She’s a leading expert in decision science, with undergraduate studies in statistics and economics at the University of Chicago and graduate studies in statistics, neuroscience, and psychology at Duke University and NCSU. When she’s not working, you’re most likely to find Cassie at the theatre, in an art museum, exploring the world, playing board games, or curled up with a good novel.

Presentations

Making data science useful Keynote

Despite the rise of data engineering and data science functions in today's corporations, leaders report difficulty in extracting value from data. Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness, and hiring experts doesn’t seem to help. Join Cassie Kozyrkov to talk about how you can change that.

Semih Kumluk is a data scientist and human resources manager at Turkcell. He came to the company as part of the Enterprise Risk Management Department reporting to the board of directors. After receiving marketing, data science, and programming training provided by the company, he moved to the Human Resources Department to manage the employees who took these training courses with him—helping create value for the company while the company gets a return on its training investments. He’s also working on data science and AI projects of his own. With his diverse background and interest areas, Semih considers himself the “Da Vinci of the private sector.” At Unilever, he worked in the hair R&D segment; was a customer service executive in the Logistics Department responsible for ice cream; and was an innovation planning manager responsible for Unilever Food Solutions. He holds a master’s degree in engineering and technology management from Bogazici University and is working toward a PhD in finance in the Management Department at Yeditepe University. He also attended an executive education program in branding at the Kellogg School of Management. In his free time, he mentors university students, runs, swims, and competes in Iron Man races.

Presentations

Data transformation of Turkcell DCS

Semih Kumluk offers an overview of the data science community within Turkcell, which generates projects in all areas of the company, including HR, finance, and sales—all working toward transforming the company into a data-driven business.

Ben Lackey leads data and AI partnerships at Oracle Cloud Infrastructure. He’s part of a team working with leading ISV partners to ensure their products run well on Oracle’s cloud. Previously, Ben was on the ISV side at companies including Couchbase and DataStax, where he worked on their partnerships with cloud providers.

Presentations

Oracle's second-generation cloud: Optimized for the partner ecosystem (sponsored by Oracle Cloud Infrastructure) Session

Join Ben Lackey to learn how Oracle Cloud Infrastructure's architecture makes it the right place to run compute-intensive partner applications like H20.ai, Cloudera, DataStax, and more.

Mounia Lalmas is a director of research and the head of tech research in personalization at Spotify. Her work focuses on studying user engagement in areas such as native advertising, digital media, social media, search, and music. Mounia also holds an honorary professorship at University College London. Previously, she was a director of research at Yahoo, where she led a team of researchers working on advertising quality for Gemini, Yahoo’s native advertising platform. She also worked with various teams at Yahoo on topics related to user engagement in the context of news, search, and user-generated content. She has given numerous talks and tutorials and is the coauthor of a book written as the outcome of her WWW 2013 tutorial on “measuring user engagement.”

Presentations

Recommending and searching at Spotify Session

Spotify's mission is "to match fans and artists in a personal and relevant way." Mounia Lalmas shares some of the (research) work the company is doing to achieve this, from using machine learning to metric validation, illustrated through examples within the context of home and search.

Claudia Regina Laselva is a Brazilian nurse with 30 years of experience and a strong background in implementing quality certifications and accreditations in hospitals. She was a pioneer in running patient flow management projects in Brazil, focused on improving hospital efficiency, optimizing processes and reducing wastefulness in care. Through her work, she demonstrates the importance of nursing to these issues and contributes to the sustainability of health systems. Claudia is a leader on strategies to improve safety and the patient experience in Brazil and Latin America. She was president of SOBRAGEN, the Brazilian Society of Management in Nursing, for two consecutive terms.

Presentations

Insightful health: Amplifying intelligence in healthcare patient flow execution Session

Fabio Ferraretto and Claudia Regina Laselva explain how Hospital Albert Einstein and Accenture evolved patient flow experience and efficiency with the use of applied AI, statistics, and combinatorial math, allowing the hospital to anticipate E2E visibility within patient flow operations, from admission of emergency and elective demands to assignment and medical releases.

Francesca Lazzeri is a senior machine learning scientist at Microsoft on the cloud advocacy team and an expert in big data technology innovations and the applications of machine learning-based solutions to real-world problems. Her research has spanned the areas of machine learning, statistical modeling, time series econometrics and forecasting, and a range of industries—energy, oil and gas, retail, aerospace, healthcare, and professional services. Previously, she was a research fellow in business economics at Harvard Business School, where she performed statistical and econometric analysis within the technology and operations management unit. At Harvard, she worked on multiple patent, publication and social network data-driven projects to investigate and measure the impact of external knowledge networks on companies’ competitiveness and innovation. Francesca periodically teaches applied analytics and machine learning classes at universities and research institutions around the world. She’s a data science mentor for PhD and postdoc students at the Massachusetts Institute of Technology and speaker at academic and industry conferences—where she shares her knowledge and passion for AI, machine learning, and coding.

Presentations

Cross-cloud model training and serving with Kubeflow Tutorial

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud.

Time series forecasting with Azure Machine Learning Tutorial

Time series modeling and forecasting is fundamentally important to various practical domains; in the past few decades, machine learning model-based forecasting has become very popular in both private and public decision-making processes. Francesca Lazzeri walks you through using Azure Machine Learning to build and deploy your time series forecasting models.

Sun Maria Lehmann is a leading engineer within the Enterprise Data Management Group at Equinor. Previously, she worked in data management at the Norwegian Hydrographic office and in drilling services at Statoil, including serving in advisory positions and as a member of the Blue Book Work Group and Diskos Well Committee. Sun holds an MSc in petroleum geoscience from NTNU.

Presentations

Architecting a data platform to support analytic workflows for scientific data Session

In upstream oil and gas, a vast amount of the data requested for analytics projects is scientific data: physical measurements about the real world. Historically, this data has been managed library style, but a new system was needed to best provide this data. Sun Maria Lehmann and Jane McConnell share architectural best practices learned from their work with subsurface data.

Implementing enterprise data management in industrial and scientific organizations Session

To succeed in implementing enterprise data management in industrial and scientific organizations and realize business value, the worlds of business data, facilities data, and scientific data—which have long been managed separately—must be brought together. Sun Maria Lehmann and Jane McConnell explore the cultural and organizational differences and the data management requirements to succeed.

Martin Leijen is the architect for the Data and Intelligence Lab, part of Rabobank’s Digital Transformation Office, where he is challenged to extend the current on-premises lab services to the cloud this year and is the privacy and security officer responsible for the data AI and analytics domain within the bank. From day one, next to introducing innovation with new open source technology, Martin has kept the team highly aware of the privacy and ethics responsibilities that come with the team’s high privilege to process all data available within and outside the bank. He started his career in Rabobank’s IT Operations Department in the Netherlands but soon became interested in the value of data and BI and was involved with the bank’s first big data and advanced analytics experiments. This resulted in an assignment to help setting up a new department, the Data Science and Business Consultancy. In his internal presentations and public speaking opportunities, he tries to motivate people to actively use their intrinsic privacy and ethics awareness in their daily job.

Presentations

There's something about data… Findata

Martin Leijen discusses how Rabobank created a data and intelligence lab as an enabler for data and business domains to accelerate in using AI and Advanced Analytics.

Mauricio Lins is a Data & Analytics Technical Manager at everis NTT DATA UK, where he is currently responsible for the AI and ML initiatives coordinating the integration between solutions with other offices around Europe and America.

He has 15 years of experience in the IT universe, mostly with application development and data platforms. In the last 5 years he has been working with Big Data implementation with a big variety of architectures and business segments. He is also a develper certified by Cloudera as a Spark specialist.

He had some works published in the CIACA Conference in Portugal around the Big Data area in the streaming and batch architectures.

Presentations

The vindication of big data: How Santander UK uses Hadoop to defend privacy Session

Big data is usually regarded as a menace to data privacy. But with data privacy principles and a customer-first mindset, it can be a game changer. Maurício Lins and Lidia Crespo explain how Santander UK applied this model to comply with GDPR, using graph technology, Hadoop, Spark, and Kudu to drive data obscuring, data portability, and machine learning exploration.

Brennan Lodge is a data scientist at Goldman Sachs. A self-proclaimed data nerd, he’s been working in the financial industry for the past 10 years and is striving to save the world with a little help from our machine friends. He’s held cybersecurity, data scientist, and leadership roles at JPMorgan Chase, the Federal Reserve Bank of New York, Bloomberg, and Goldman Sachs. Brennan holds a masters’ degree in Business Analytics from New York University and participates in the data science community with his nonprofit pro bono work at DataKind and as a co-organizer for the NYU Data Science and Analytics Meetup. Brennan is also an instructor at the New York Data Science Academy and teaches data science courses in R and Python.

Presentations

Evaluating cybersecurity defenses with a data science approach Session

Cybersecurity analysts are under siege to keep pace with the ever-changing threat landscape. The analysts are overworked as they are bombarded with and burned out by the sheer number of alerts that they must carefully investigate. Brennan Lodge and Jay Kesavan explain how to use a data science model for alert evaluations to empower your cybersecurity analysts.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience. He enjoys intelligent design and engaging storytelling and is passionate about data, music, and nature.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Building a serverless big data application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Julio López is a data scientist at Inditex, where he focuses on solving and improving AI and ML solutions. Previously, he was a researcher in high-performance computing and a software developer. He holds an interuniversity MAS in information technology and a master’s degree in mathematics from USC.

Presentations

Engineering ML to improve the shopping experience (sponsored by Zara Tech) Session

Julio López explains how Zara Tech uses indirect observation and ML engineering to augment the understanding of Zara's processes in order to improve them. Join in to learn how Zara Tech built and uses a Spark ML pipeline to provide KPIs to improve the shopping experience.

Ben Lorica is the chief data scientist at O’Reilly. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Sustaining machine learning in the enterprise Keynote

Keynote with Ben Lorica

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

David Low is the cofounder and chief data scientist at Pand.ai, a company building an AI-powered chatbot to disrupt and shape the booming conversational commerce space with deep natural language processing. He represented Singapore and the National University of Singapore (NUS) in the 2016 Data Science Games held in France, and clinched the top spot among Asian and American teams. David has been invited as a guest lecturer by NUS to conduct master classes on applied machine learning and deep learning topics. Throughout his career, David has engaged in data science projects across manufacturing, telco, ecommerce, and the insurance industry, including sales forecast modeling and influencer detection, which won him awards in several competitions and was featured on the IDA website and the NUS publication. Previously, he was a data scientist at the Infocomm Development Authority (IDA) of Singapore and was involved in research collaborations with Carnegie Mellon University (CMU) and Massachusetts Institute of Technology (MIT) on projects funded by the National Research Foundation and SMART. He competes on Kaggle and holds a top 0.2% worldwide ranking.

Presentations

The unreasonable effectiveness of transfer learning on NLP Session

Transfer learning has been proven to be a tremendous success in computer vision—a result of the ImageNet competition. In the past few months, there have been several breakthroughs in natural language processing with transfer learning, namely ELMo, OpenAI Transformer, and ULMFit. David Low demonstrates how to use transfer learning on an NLP application with SOTA accuracy.

Feng Lu is a software engineer at Google and the tech lead and manager for Cloud Composer. Feng has a broad interest in cloud and big data analytics. He holds a PhD from UC San Diego, where his research work was reported on by MIT Technology Review among others.

Presentations

Migrating Apache Oozie workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution.

Boris Lublinsky is a principal architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, he was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on machine learning with Kafka-based streaming pipelines Tutorial

Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipelines and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques.

Swetha Machanavajhala is a software engineer for Azure Networking at Microsoft, where she builds tools to help engineers detect and diagnose network issues within seconds. She is very passionate about building products and awareness for people with disabilities and has led several related projects at hackathons, driving them from idea to reality to launching as a beta product and winning multiple awards. Swetha is a co-lead of the Disability Employee Resource Group, where she represents the community of people who are deaf or hard of hearing, and is a part of the ERG chair committee. She is also a frequent speaker at both internal and external events.

Presentations

Inclusive design: Deep learning on audio in Azure, identifying sounds in real time Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.

Mark Madsen is a Fellow at Teradata, where he’s responsible for understanding, forecasting, and defining analytics ecosystems and architectures. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning, and vendors on product management. Mark has designed analysis, machine learning, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Romi Mahajan is CCO at leading AI company Quantarium and CEO at KKM Group, an advisory and investment firm with interests in 35 technology-related companies. A marketer, author, activist, and philanthropist, Romi spent a decade at Microsoft and has been CMO of five companies.

Presentations

Real estate, real AI: Insights and decisions in the world's largest asset class Findata

Residential real estate is the world's largest asset class, and "dwellings" constitute the single largest purchase for most families around the globe. Still, in the world's largest residential real estate markets, the process of valuing, buying, and selling houses is byzantine, analog, and mysterious. Romi Mahajan explains why sophisticated and real-world AI is the key to democratizing value.

Manish Maheshwari is a data architect and data scientist at Cloudera. Manish has 13+ years of experience building extremely large data warehouses and analytical solutions. He’s worked extensively on Hadoop, DI and BI tools, data mining and forecasting, data modeling, master and metadata management, and dashboard tools and is proficient in Hadoop, SAS, R, Informatica, Teradata, and Qlikview. He participates in Kaggle Data Mining competitions as a hobby.

Presentations

Scaling Impala: Common mistakes and best practices Session

Apache Impala is an MPP SQL query engine for planet-scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. Manish Maheshwari explains how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and anti-patterns for end users or BI applications.

Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Foundations for successful data projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Mastering streaming and pipelines: Designing and supporting the nervous system of your company Session

The world of data is all about building the best path to support time and quality to value. 80% to 90% of the work is getting the data into the hands and tools that can create value. Ted Malaska takes you on a journey to investigate strategies and designs that can change the way your company looks and approaches data.

Sundeep Reddy Mallu is senior vice president of product development at data science company Gramener, where he leads a team of data enthusiasts who tell visual stories of insights from analysis, built on Gramex, Gramener’s data science-in-a-box platform. Previously, Sundeep worked at Comcast Cable, NeoTech Solutions, Birlasoft, and Wipro Technologies and was a consultant for federal agencies in the US and India. He holds an bachelor’s degree in electrical engineering and an MBA in IT and marketing.

Presentations

India's data dilemma with India Stack Session

Answering the simple question of what rights Indian citizens have over their data is a nightmare. The rollout of India Stack technology-based solutions has added fuel to fire. Sundeep Reddy Mallu explains, with on-the-ground examples, how businesses and citizens in India's booming digital economy are navigating the India Stack ecosystem while dealing with data privacy, security, and ethics.

James Malone is a product manager for Google Cloud Platform and manages Cloud Dataproc and Apache Beam (incubating). Previously, James worked at Disney and Amazon. He’s a big fan of open source software because it shows what’s possible when people come together to solve common problems with technology. He also loves data, amateur radio, Disneyland, photography, running, and Legos.

Presentations

Migrating Apache Oozie workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution.

David Maman is founder, CEO, and CTO at Binah. A serial entrepreneur, David founded HexaTier/GreenSQL (acquired by Huawei), Precos, Vanadium-soft, GreenCloud, Teridion, Terrasic, and ReSec, among others. Previously, he was a director in Fortinet’s CTO office, where he managed information security at the Israeli telecom Bezeq. He has 24 years’ experience in leadership, AI, cybersecurity, development, and networking and is a veteran of an elite Israel Defense Forces (IDF) unit. He was named one of the top 40 Israeli internet startup professionals by TheMarker Magazine and one of the top 40 under 40 most promising Israeli business professionals by Globes magazine. David holds a master’s degree in computer science from Open University.

Presentations

Signal processing, machine learning, and video tell the truth Session

David Maman demonstrates how the combination of a mere few minutes of video, signal processing, remote heart-rate monitoring, machine learning, and data science can identify a person’s emotions, health condition and performance. Financial institutions and potential employers can now analyze whether you have good or bad intentions.

Shingai Manjengwa is the chief executive officer at Fireside Analytics Inc., a Canadian ed-tech startup that offers customized cloud-hosted data science training and consulting services to corporations, governments, and educational institutions. Fireside Analytics’s data science courses have over 300,000 registered learners on platforms like IBM’s CognitiveClass.ai and Coursera. An IBM Influencer, author, and NYU Stern alumni, Shingai is also the founder of Fireside Analytics Academy, a registered private high school that teaches high school students to solve problems with data. The IDC4U, High School Data Science program—inspected by the Ministry of Education in Canada—uses real-life youth-focused case studies to combine statistics, mathematics, business, and computer programming: the pillars of data science. The program is completely online; international students are welcome. The curriculum is also offered in three private high schools in Ontario, Canada.

Presentations

Building data science capacity in your organization Keynote

Shingai Manjengwa shares insights from teaching data science to 300,000 online learners, second-career college graduates, and grade 12/6th form high school students, explaining how business leaders can increase data science skill sets across different levels and functions in an organization to create real and measurable value from data.

Cecilia Marchi is a manager at Jakala, where she helps retail and FMCG companies create a sustainable competitive advantage and increase their top line by leveraging data, advanced analytics and AI, location analytics, technologies, and experience design. She has more than eight years of experience in consulting and martech.

Presentations

Data-intense profiling of points of consumption to increase sales and marketing effectiveness DCS

Cecilia Marchi shares a case history of a major beverage company that wanted to get a detailed picture of the French out-of-home market. The combination of unusual data sources, technology assets, geographical analysis, and advanced analytics led to the deployment of a tool now supporting both sales and marketing team in understanding the market and optimizing activities and investments.

Jane McConnell is a practice partner for oil and gas within Teradata’s Industrial IoT Group, where she shows oil and gas clients how analytics can provide strategic advantage and business benefits in the multimillions. Jane is also a member of Teradata’s IoT core team, where she sets the strategy and positioning for Teradata’s IoT offerings and works closely with Teradata Labs to influence development of products and services for the industrial space. Originally from an IT background, Jane has also done time with dominant market players such as Landmark and Schlumberger in R&D, product management, consulting, and sales. In one role or another, she has influenced information management projects for most major oil companies across Europe. She chaired the education committee for the European oil industry data management group ECIM, has written for Forbes, and regularly presents internationally at oil industry events. Jane holds a BEng in information systems engineering from Heriot-Watt University in the UK. She is Scottish and has a stereotypical love of single malt whisky.

Presentations

Architecting a data platform to support analytic workflows for scientific data Session

In upstream oil and gas, a vast amount of the data requested for analytics projects is scientific data: physical measurements about the real world. Historically, this data has been managed library style, but a new system was needed to best provide this data. Sun Maria Lehmann and Jane McConnell share architectural best practices learned from their work with subsurface data.

Implementing enterprise data management in industrial and scientific organizations Session

To succeed in implementing enterprise data management in industrial and scientific organizations and realize business value, the worlds of business data, facilities data, and scientific data—which have long been managed separately—must be brought together. Sun Maria Lehmann and Jane McConnell explore the cultural and organizational differences and the data management requirements to succeed.

Darragh McConville is a solution architect specializing in data engineering at Kainos, where he’s Kainos’s lead architect for NewDay’s AWS data platform. He’s been working with data-intensive systems for over a decade and was the founder of Kainos’s data and analytics capability. He enjoys working with talented people and like every engineer, loves a technical challenge. In his spare time, he’s usually up a mountain or in a squash court but lately has developed an unhealthy fascination with unsolved crimes.

Presentations

Transforming a financial services data infrastructure for the modern era by building a PCI DSS-compliant data platform from the ground up on AWS Session

Eoin O'Flanagan and Darragh McConville explain how NewDay built a high-performance contemporary data processing platform from the ground up on AWS. Join in to explore the company's journey from a traditional legacy onsite data estate to an entirely cloud-based PCI DSS-compliant platform.

Michael McCune is a software developer in Red Hat’s Emerging Technology Group, where he develops and deploys application for cloud platforms. He’s an active contributor to several radanalytics.io projects and a core reviewer for the OpenStack API Working Group. Previously, Michael developed Linux-based software for embedded global positioning systems.

Presentations

Application intelligence: Bridging the gap between human expertise and machine learning Session

Artificial intelligence and machine learning are now popularly used terms, but how do you make use of these techniques without throwing away the valuable knowledge of experienced employees? Rebecca Simmonds and Michael McCune delve into this idea with examples of how distributed machine learning frameworks fit together naturally with business rules management systems.

Alket Memushaj is a solutions architect at AWS.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Simona Meriam is a big data engineer at Nielsen Marketing Cloud, where she specializes in research and development of solutions for big data infrastructures using cutting-edge technologies such as Spark, Kafka, and Elasticsearch.

Presentations

Nielsen presents: Fun with Kafka, Spark, and offset management Session

Simona Meriam explains how Nielsen Marketing Cloud (NMC) used to manage its Kafka consumer offsets against Spark-Kafka 0.8 consumer and why the company decided to upgrade from Spark-Kafka 0.8 to 0.10 consumer. Simona reviews the problems encountered during the upgrade and details the process that led to the solution.

Michele Miraglia is a manager at Data Reply. Michele coordinates the center of expertise that deals with big data and machine learning to support sales and marketing functions. His team offers innovative data management and analysis solutions based on the most modern machine learning and artificial intelligence techniques in order to extract the greatest possible business value from them.

Presentations

How retailers can leverage data to stay competitive in an ever-changing digital landscape (sponsored by Data Reply) Session

Retailers are facing a daunting challenge: remaining competitive in an ever-changing landscape that is becoming increasingly digital—which requires them to overcome rifts in internal systems and seamlessly leverage their data to generate business value. Luca Piccolo and Michele Miraglia outline Data Reply's approach, distilled while supporting retailers in successfully tackling these challenges.

Cameron Moberg is a senior computer science student at Truman State University in Missouri and a research intern on the Cloud Composer team at Google. Previously, he held two other internships at Google. Cameron has a passion for open source projects, with a recent interest in Apache Airflow and Apache Oozie.

Presentations

Migrating Apache Oozie workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution.

Robin is a Developer Advocate at Confluent, the company founded by the creators of Apache Kafka, as well as an Oracle Developer Champion and ACE Director Alumnus. His career has always involved data, from the old worlds of COBOL and DB2, through the worlds of Oracle and Hadoop, and into the current world with Kafka. His particular interests are analytics, systems architecture, performance testing and optimization. He blogs at http://cnfl.io/rmoff and http://rmoff.net/ (and previously http://ritt.md/rmoff) and can be found tweeting grumpy geek thoughts as @rmoff. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.

Presentations

Real-time SQL stream processing at scale with Apache Kafka and KSQL Tutorial

Robin Moffatt walks you through the architectural reasoning for Apache Kafka and the benefits of real-time integration. You'll then build a streaming data pipeline using nothing but your bare hands, Kafka Connect, and KSQL.

The changing face of ETL: Event-driven architectures for data engineers Session

Robin Moffatt discusses the concepts of events, their relevance to software and data engineers, and their ability to unify architectures in a powerful way. Join in to learn why analytics, data integration, and ETL fit naturally into a streaming world. Along the way, Robin will lead a hands-on demonstration of these concepts in practice and commentary on the design choices made.

Simon Moritz is an IoT ecosystem evangelist at Ericsson. A passionate ex–data scientist who moved over to the business side a few years back due to a lack of understanding about the importance of a data-driven business, Simon has since created new data-driven offerings, acting as the lead architect behind Sweden’s Strategic Innovation Program related to transportation, Drive Sweden. Drive Sweden consists of more than 90 global partners in the area of transportation with the purpose of setting a new de facto standard way of working and a digital infrastructure worthy of the Fourth Industrial Revolution.

Presentations

The digital truth and the physical twin DCS

The truth is no longer what you see with your eyes; the truth is in the digital sphere, where it only sometimes needs a physical twin. After all, what's the need for a road sign along the street if the information is already in the car? Simon Moritz details how the Fourth Industrial Revolution is transforming companies and business models as we know it.

Colm Moynihan is partner presales manager in EMEA for Cloudera, where he helps system integrators, ISVs, hardware, cloud partners, resellers, and distributors drive digital transformation into joint customers. Previously, Colm was director of presales in EMEA at Informatica, working with resellers, OEMs, and GSIs to integrate, master, and cleanse customers’ enterprise data. Colm has over 25 years’ experience in development, consulting, finance and banking, startups, and large multinational software companies. Colm holds a master’s degree in distributed computing from Trinity College Dublin.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses a number of challenges. Join Colm Moynihan, Jonathan Seidman, and Michael Kohs to explore cloud architecture and challenges and learn how to use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Francesco Mucio is a BI architect at Zalando. The first time Francesco met the word data, it was just the plural of datum. Now he’s helping to redraw Zalando’s data architecture. He likes to draw data models and optimize queries. He spends his free time with his daughter, who, for some reason, speaks four languages.

Presentations

From BI to big data; Or, There and back again Session

Francesco Mucio shares the basic tools he and his team had to learn (or relearn) moving from the coziness of their database to the big world of Spark, cloud, distributed systems, and continuous applications. It was an unexpected journey that ended exactly where it started: with an SQL query.

Constantin Muraru is an engineer at Adobe. In his time at the company, he has worked on various Adobe Marketing Cloud solutions, where he got to experiment with mobile, video, and backend development. His team currently focuses on offering infrastructure automatization and fast deployments in Adobe Audience Manager. Outside business hours, he loves playing pool and enjoys a good book.

Presentations

Deploying your real-time apps on thousands of servers and still being able to breathe Session

With the current crop of cloud providers, obtaining servers to run your real-time application has never been easier. But what happens, though, when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers, in a fast, reliable way, with minimal human intervention? Constantin Muraru and Dan Popescu tell you how to tackle this challenge.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Running SQL-based workloads in the cloud at 20x–200x lower cost using Apache Arrow Session

Performance and cost are two important considerations in determining optimized solutions for SQL workloads in the cloud. Jacques Nadeau explains how to accelerate TPC workloads, invisible to client apps, and how to use Apache Arrow, Parquet, and Calcite to provide a scalable, high-performance solution optimized for cloud deployments while significantly reducing operational costs.

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.

Presentations

Data Case Studies welcome Tutorial

Paco Nathan welcomes participants to the Data Case Studies tutorial.

Executive Briefing: Overview of data governance Session

Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa.

Sami Niemi is a vice president and managing data scientist at Barclays, where he leads a team of data scientists building fraud detection models and manages the UK fraud models. Sami has been working on Bayesian inference and machine learning for over 10 years and has published peer-reviewed papers in astrophysics and statistics. He has delivered machine learning models for telecommunications and financial services and built supervised learning models to predict customer and company defaults, first- and third-party fraud, and customer complaints, and used natural language processing for probabilistic parsing and matching. He has also used unsupervised learning in a risk-based anti-money-laundering application.

Presentations

Predicting real-time transaction fraud using supervised learning Session

Predicting transaction fraud of debit and credit card payments in real time is an important challenge, which state-of-art supervised machine learning models can help to solve. Sami Niemi offers an overview of the solutions Barclays has been developing and testing and details how well models perform in variety of situations like card present and card not present debit and credit card transactions.

Jack Norris is the senior vice president of data and applications at MapR Technologies, where he works with leading customers and partners worldwide to drive the understanding and adoption of new applications enabled by data and analytics. With over 25 years of enterprise software experience, he has demonstrated success from identifying new markets to defining new products to launching companies. Jack’s background includes senior executive positions with establishing analytic, virtualization, and storage companies. Jack was an early employee of MapR Technologies and held senior executive roles with EMC, Brio Technology, and Bain and Company.

Presentations

Executive Briefing: The hidden data scientists lurking in your company Session

Many companies delay addressing core improvements in increasing revenues, reducing costs and risk exposure by tying changes to a to-be-hired data scientist. Drawing on three customer examples, Jack Norris explains how to achieve excellent results faster by starting with domain experience and helping developers and analysts better leverage data with available and understandable analytics.

Kris Nova is a senior developer advocate at Heptio focusing on containers, infrastructure, and Kubernetes. She is also an ambassador for the Cloud Native Computing Foundation. Previously, Kris was a developer advocate and an engineer on Kubernetes in Azure at Microsoft. She has a deep technical background in the Go programming language and has authored many successful tools in Go. Kris is a Kubernetes maintainer and the creator of kubicorn, a successful Kubernetes infrastructure management tool. She organizes a special interest group in Kubernetes and is a leader in the community. Kris understands the grievances with running cloud native infrastructure via a distributed cloud native application and recently authored an O’Reilly book on the topic: Cloud Native Infrastructure. Kris lives in Seattle, WA, and spends her free time mountaineering.

Presentations

Autoscaling Spark on Kubernetes Session

In the Kubernetes world, where declarative resources are a first-class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice, we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova and Holden Karau for a fun adventure.

Jan is an eFX Quant Trader at Deutsche Bank. Before his current role, he was a front office quant at HSBC in the electronic FX markets working. Before joining HSBC team, he was working in the Centre for Econometric Analysis on the high-frequency time series econometric models and was visiting lecturer at Cass Business School, Warwick Business School, Politecnico di Milano and New York University in Prague. He co-authored a number of papers in peer-reviewed journals in Finance and Physics, contributed to several books, and presented at numerous conferences and workshops all over the world. During his PhD studies, he co-founded Quantum Finance CZ. He is a Machine Learning enthusiast and explores kdb+/q for this purpose.

Presentations

Digital transformation of the trading floor Findata

Machine Learning and AI are an inevitable part of any workflow, which deals with data. In this talk, I will review several Machine Learning topics, which contributes to the digitalisation of the trading floor and show how the role of quants changes in the Machine Learning era.

Eoin O’Flanagan is a lead data engineer at NewDay, where, for the past couple of years, he has worked as part of NewDay’s digital transformation, specifically in bringing in and enabling new data capabilities. He previously worked at data analytics firm Dunnhumby, where he held roles across data, IT, and architecture.

Presentations

Transforming a financial services data infrastructure for the modern era by building a PCI DSS-compliant data platform from the ground up on AWS Session

Eoin O'Flanagan and Darragh McConville explain how NewDay built a high-performance contemporary data processing platform from the ground up on AWS. Join in to explore the company's journey from a traditional legacy onsite data estate to an entirely cloud-based PCI DSS-compliant platform.

Brian O’Neill is the founder and consulting product designer at Designing for Analytics, where he focuses on helping companies design indispensable data products that customers love. Brian’s clients and past employers include Dell EMC, NetApp, TripAdvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JPMorgan Chase, the Future of Music Coalition, and E*TRADE, among others; over his career, he has worked on award-winning storage industry software for Akorri and Infinio. Brian has been designing useful, usable, and beautiful products for the web since 1996. Brian has also brought over 20 years of design experience to various podcasts, meetups, and conferences such as the O’Reilly Strata Conference in New York City and London, England. He is the author of the Designing for Analytics Self-Assessment Guide for Non-Designers as well numerous articles on design strategy, user experience, and business related to analytics. Brian is also an expert advisor on the topics of design and user experience for the International Institute for Analytics. When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage performing as a professional percussionist and drummer. He leads the acclaimed dual-ensemble Mr. Ho’s Orchestrotica, which the Washington Post called “anything but straightforward,” and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival. If you’re at a conference, just look for only guy with a stylish orange leather messenger bag.

Presentations

Empathy: The secret ingredient in the design of engaging data products and analytics tools Session

Brian O'Neill explains how design is fundamentally improving the bottom line of business and can help data teams uncover the real problems and needs of customers and business stakeholders. Join in to learn and practice a key aspect of good design: how to properly interview stakeholders and users.

Cait O’Riordan is the Financial Time’s (FT) chief product and information officer (CPIO). She’s responsible for platform and product strategy, development and operations across the FT Group, working in close partnership with editorial and commercial teams. She’s on the FT executive board, which is responsible for the company’s global strategy and performance. Previously, Cait led the BBC’s digital product development for the London 2012 Olympics and played a central role in the user and revenue growth of music app company Shazam.

Presentations

Finding your North Star Keynote

The Financial Times hit its target of 1 million paying subscribers a year ahead of schedule. Cait O'Riordan discusses the North Star metric the company uses to drive subscriber growth, detailing how it's embedded across the organization and within the engineering and product teams she's responsible for.

Larry Orimoloye is a solutions architect at Dataiku. He’s interested in driving tangible business value by combining advanced analytics using structured and unstructured data across all industries and enjoys bridging the gap between academic research and industry. He helps clients deliver ROI utilizing a business-led, technology-enabled approach to analytics; in particular, he has helped clients establish centers of excellence with an analytics remit across the organization and designed and implemented customer-centric real-time decision platforms using a combination of statistics, big data, and machine learning techniques. He holds a master’s degree in applied statistics and data mining from the University of St. Andrews.

Presentations

Augment your recommender system with transfer learning on images (sponsored by Dataiku) Session

Recommender systems are tools that provide suggestions that best suit the customers' needs, even when they're not aware of it. Larry Orimoloye explains how Dataiku helped one of the world's leading vacation retailers drive customers toward better recommendations.

Jerry Overton is a data scientist and fellow in the Analytics Group and the global lead for artificial intelligence at DXC. Jerry is the author of Going Pro in Data Science: What It Takes to Succeed as a Professional Data Scientist from O’Reilly and teaches the live online training course Mastering Data Science at Enterprise Scale: How to Design and Implement Machine Learning Solutions That Improve Your Organization. In his blog, Doing Data Science, Jerry shares his experiences leading open research and transforming organizations using data science.

Presentations

How to keep ethical with machine learning Session

Machine learning (ML) algorithms are good at learning new behaviors but bad at identifying when those behaviors are harmful or don’t make sense. Bias, ethics, and fairness are big risk factors in ML. However, we creators have a lot of experience dealing with intelligent beings—one another. Jerry Overton uses this common sense to build a checklist for protecting against ethical violations with ML.

Laila Paszti is a lawyer practicing technology and privacy law at GTC Law Professional Corp. She’s also a software applications engineer. She previously held positions at ExxonMobil and Capstone Technology, where she designed and implemented machine learning (AI) software solutions to optimize industrial processes. She routinely advises both Fortune 100 and startup clients on all aspects of the development and commercialization of their technology solutions (including big data, predictive modeling, and machine learning) in diverse industries including fintech, healthcare, and the automotive industry. She’s a steering committee member of the Toronto Machine Learning Symposium and will be a panel member discussing responsible AI innovation in November. She has spoken most recently at the Global Blockchain Conference, the Healthcare Blockchain in Canada conference, and the Linux FinTech Forum. Laila will be a faculty member for the upcoming Osgoode Certificate in Blockchains, Smart Contracts, and the Law (November 2018). She holds a BASc in chemical engineering from the University of Toronto, an MASc in chemical engineering from the University of Waterloo, and a JD from the University of Toronto, where she was a law review editor. She is admitted to practice in New York and Ontario. She’s also a Certified Information Privacy Professional (Canada) (CIPP/C).

Presentations

Responsible AI innovation Session

As companies commercialize novel applications of AI in areas such as finance, hiring, and public policy, there's concern that these automated decision-making systems may unconsciously duplicate social biases, with unintended societal consequences. Laila Paszti shares practical advice for companies to counteract such prejudices through a legal- and ethics-based approach to innovation.

Yves Peirsman is the founder and natural language processing expert at NLP Town. Yves started his career as a PhD student at the University of Leuven and a postdoctoral researcher at Stanford University. Since he made the move from academia to industry, he has gained extensive experience in consultancy and software development for NLP projects in Belgium and abroad.

Presentations

Dealing with data scarcity in natural language processing Session

In this age of big data, NLP professionals are all too often faced with a lack of data: written language is abundant, but labeled text is much harder to come by. Yves Peirsman outlines the most effective ways of addressing this challenge, from the semiautomatic construction of labeled training data to transfer learning approaches that reduce the need for labeled training examples.

Nick Pentreath is a principal engineer at the Center for Open Source Data & AI Technologies (CODAIT) at IBM, where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations, and was at Goldman Sachs, Cognitive Match, and Mxit. He’s a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Building a secure and transparent ML pipeline using open source technologies Session

The application of AI algorithms in domains such as criminal justice, credit scoring, and hiring holds unlimited promise. At the same time, it raises legitimate concerns about algorithmic fairness. There's a growing demand for fairness, accountability, and transparency from machine learning (ML) systems. Nick Pentreath explains how to build just such a pipeline leveraging open source tools.

Dirk Petzoldt is a head of engineering and data science at Zalando, Europe’s leading fashion platform. Trained as a data scientist, he enables his five development teams to revolutionize online marketing steering in a fully automated, ROI-driven, personalized way. In his spare time, Dirk is hacking functional Scala and reading through O’Reilly’s online library, 10 books at a time.

Presentations

Insights from engineering Europe's largest marketing platform for fashion Session

Dirk Petzoldt shares a case study from Europe’s leading online fashion platform Zalando illustrating its journey to a scalable, personalized machine learning–based marketing platform.

Thomas Phelan is cofounder and chief architect of BlueData. Previously, a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially availably 64-bit file system; and an early employee at VMware, a senior staff engineer and a key member of the ESX storage architecture team where he designed and developed the ESX storage I/O load-balancing subsystem and modular pluggable storage architecture as well as led teams working on many key storage initiatives such as the cloud storage gateway and vFlash.

Presentations

Deep learning with TensorFlow and Spark using GPUs and Docker containers Session

Organizations need to keep ahead of their competition by using the latest AI, ML, and DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. Thomas Phelan discusses the effective deployment of such applications in a container environment.

Luca Piccolo is a manager at Data Reply, where he specializes in big data analytics and data science: this breadth of experience allows him to easily bridge the gap between business, data modeling, and technology. Drawing on his international experience across different industries, he supports business stakeholders in the identification of relevant value cases, then leading medium and large groups toward shared goals, abstracting the delivery complexity for his customers.

Presentations

How retailers can leverage data to stay competitive in an ever-changing digital landscape (sponsored by Data Reply) Session

Retailers are facing a daunting challenge: remaining competitive in an ever-changing landscape that is becoming increasingly digital—which requires them to overcome rifts in internal systems and seamlessly leverage their data to generate business value. Luca Piccolo and Michele Miraglia outline Data Reply's approach, distilled while supporting retailers in successfully tackling these challenges.

Willem Pienaar leads the data science platform team at GOJEK, working on the GOJEK ML platform, which supports a wide variety of models and handles over 100 million orders every month. His main focus areas are building data and ML platforms, allowing organizations to scale machine learning and drive decision making. In a previous life, Willem founded and sold a networking startup and was a software engineer in industrial control systems.

Presentations

Unlocking insights in AI by building a feature store Session

Features are key to driving impact with AI at all scales, allowing organizations to dramatically accelerate innovation and time to market. Willem Pienaar and Zhiling Chen explain how GOJEK, Indonesia's first billion-dollar startup, unlocked insights in AI by building a feature store called Feast, and the lessons they learned along the way.

Dan Popescu is a site reliability engineer on the Adobe Audience Manager team at Adobe, where he’s currently focused on creating and deploying continuous delivery pipelines for applications within the project—dealing with all aspects of the automation process from instance provisioning to application deployments. Dan is passionate about technology and recently about programming in general. He also loves playing video games.

Presentations

Deploying your real-time apps on thousands of servers and still being able to breathe Session

With the current crop of cloud providers, obtaining servers to run your real-time application has never been easier. But what happens, though, when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers, in a fast, reliable way, with minimal human intervention? Constantin Muraru and Dan Popescu tell you how to tackle this challenge.

Phillip Radley is chief data architect on the core enterprise architecture team at BT, where he’s responsible for data architecture across the company. Based at BT’s Adastral Park campus in the UK, Phill leads BT’s MDM and big data initiatives, driving associated strategic architecture and investment road maps for the business. He’s worked in IT and communications for 30 years. Previously, Phill was been chief architect for infrastructure performance-management solutions from UK consumer broadband to outsourced Fortune 500 networks and high-performance trading networks. He has broad global experience, including with BT’s concert global venture in the US and five years as an Asia Pacific BSS/OSS architect based in Sydney. Phill is a physics graduate with an MBA.

Presentations

Information architecture for an enterprise data cloud Session

It's now possible to build a modern data platform capable of storing, processing, and analyzing a wide variety of data across multiple public and private cloud platforms and on-premises data centers. Mark Samson and Phillip Radley outline an information architecture for such a platform, informed by working with multiple large organizations that have built such platforms over the last five years.

Greg Rahn is director of product management at Cloudera, where he’s responsible for driving SQL product strategy as part of the company’s data warehouse product team, including working directly with Impala. For over 20 years, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently product management, providing a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

The future of cloud native data warehousing: Emerging trends and technologies Session

Data warehouses have traditionally run in the data center, and in recent years, they've been adapted to be more cloud native. Greg Rahn discusses a number of emerging trends and technologies that will impact how data warehouses are run both in the cloud and on-premises and explains what that means for architects, administrators, and end users.

Vidya Raman leads product management for machine learning at Cloudera. Previously, she helped build highly successful software portfolios in several industry verticals, including telecom, healthcare, energy, and the IoT. Her experience spans early-stage startups, pre-IPO companies, and big enterprises. Vidya holds a masters in business administration from Duke University.

Presentations

Starting with the end in mind: Lessons learned from data strategies that work Session

Not surprisingly, there's no single approach to embracing data-driven innovations within any industry vertical. However, some enterprises are doing a better job than others when it comes to establishing a culture, process, and infrastructure that lends itself to data-driven innovations. Vidya Raman explores some key foundational ingredients that span multiple industries.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); worked briefly on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper. He’s the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin–Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Architecture and algorithms for end-to-end streaming data processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams.

Infinite retention using storage offloading with Apache Pulsar Session

This talk discusses how Apache Pulsar provides infinite retention of events in topics. We will discuss how the segment oriented architecture allows unlimited topic growth, how you can keep costs down by using tiered storage and how you can run ad-hoc queries on the topic using SQL.

Model serving via Pulsar functions Session

Arun Kejariwal and Karthik Ramasamy walk you through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same.

Marc Rind is chief data scientist and vice president of product development at ADP, where he’s responsible for leading the research and development of the company’s analytics and big data initiative and driving innovation and thought leadership in building ADP’s Client Analytics platform. Marc was also an instrumental leader behind the small business market payroll platform RUN Powered by ADP and led a number of the technology teams responsible for delivering its critically acclaimed product focused on its innovative user experience for small business owners. Marc’s innovative spirit and fascination with data was forged at Bolt Media, a dot-com startup based in NYC’s Silicon Alley. The company was an early predecessor to today’s social media outlets. As an early data scientist, Marc focused on the patterns and predictions of site usage through the harnessing of the data on its 10+ million user profiles.

Presentations

The power of merging multifunctional expertise to create innovative, data-driven products DCS

Marc Rind shares his experience creating a cross-functional team, discusses the power of listening to others’ points of view (and what you can learn from them), and explores real-world case studies of leaders with varying backgrounds and perspectives who collaborated to take data from analysis to idea to product rollout.

Duncan Ross is chief data officer at Times Higher Education. Duncan has been a data miner since the mid-1990s. Previously at Teradata, Duncan created analytical solutions across a number of industries, including warranty and root cause analysis in manufacturing and social network analysis in telecommunications. In his spare time, Duncan has been a city councilor, chair of a national charity, founder of an award-winning farmers market, and one of the founding directors of the Institute of Data Miners. More recently, he cofounded DataKind UK and regularly speaks on data science and social good.

Presentations

Using data for evil V: The AI strikes back Session

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis (and possibly AI). Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse.

Why is it so hard to do AI for good? Session

DataKind UK has been working in data for good since 2013, helping over 100 UK charities to do data science for the benefit of their users. Some of those projects have delivered above and beyond expectations; others haven't. Duncan Ross and Giselle Cory explain how to identify the right data for good projects and how this can act as a framework for avoiding the same problems across industry.

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Executive Briefing: AWS technology trends—Data lakes and analytics Session

Nikki Rouda shares key trends in data lakes and analytics and explains how they shape the services offered by AWS. Specific topics include the rise of machine-generated data and semistructured and unstructured data as dominant sources of new data, the move toward serverless, SPI-centric computing, and the growing need for local access to data from users around the world.

Executive Briefing: Big data in the era of heavy worldwide privacy regulations Session

The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance.

S.P.T. Krishnan is a computer scientist and engineer with 18+ years of professional research and development experience in cloud computing, big data analytics, machine learning, and computer security. He’s a Google Developer Expert in Google Cloud Platform and an authorized trainer for Google Cloud Platform. He’s also an adjunct faculty in computer science and has taught 500+ university students in the past five years. He has worked as an architect and developer on Amazon Web Services, Google Cloud Platform, OpenStack, and Microsoft Azure. He authored Building Your Next Big Thing with Google Cloud Platform and has spoken at both Black Hat and RSA. He’s also a cofounder of the Google Developer Group, Singapore. Red Hat recently recognized him as the “Red Hat Certified Engineer of the Year.” He holds a PhD in computer engineering from the National University of Singapore, where he studied the performance characteristics of high-performance computing algorithms by evaluating them on different multiprocessor architectures.

Presentations

Using AWS serverless technologies to analyze large datasets Tutorial

Krishnan Saidapet offers an overview of the latest big data and machine learning serverless technologies from Amazon Web Services (AWS) and leads a deep dive into using them to process and analyze two different datasets: the publicly available Bureau of Labor Statistics dataset and the Chest X-Ray Image Data dataset.

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka.

Presentations

How do you evolve your data infrastructure? Session

Developing data infrastructure is not trivial; neither is changing it. It takes effort and discipline to make changes that can affect your team. Neelesh Salian discusses how Stitch Fix's data platform team maintains and innovates its infrastructure for the company's data scientists.

Shioulin Sam is a research engineer at Cloudera Fast Forward Labs, where she bridges academic research in machine learning with industrial applications. Previously, she managed a portfolio of early stage ventures focusing on women-led startups and public market investments and worked in the investment management industry designing quantitative trading strategies. She holds a PhD in electrical engineering and computer science from the Massachusetts Institute of Technology.

Presentations

Learning with limited labeled data Session

Supervised machine learning requires large labeled datasets—a prohibitive limitation in many real-world applications. What if machines could learn with fewer labeled examples? Shioulin Sam shares an algorithmic solution that relies on collaboration between humans and machines to label smartly and discusses product possibilities.

Manos Samatas is a solutions architect at Amazon Web Services.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Mark Samson is a principal systems engineer at Cloudera, helping customers solve their big data problems using enterprise data hubs based on Hadoop. Mark has 17 years’ experience working with big data and information management software in technical sales, service delivery, and support roles.

Presentations

Information architecture for an enterprise data cloud Session

It's now possible to build a modern data platform capable of storing, processing, and analyzing a wide variety of data across multiple public and private cloud platforms and on-premises data centers. Mark Samson and Phillip Radley outline an information architecture for such a platform, informed by working with multiple large organizations that have built such platforms over the last five years.

Danilo Sato is a principal consultant at ThoughtWorks with more than 17 years of experience in many areas of architecture and engineering: software, data, infrastructure, and machine learning. Balancing strategy with execution, Danilo helps clients refine their technology strategy while adopting practices to reduce the time between having an idea, implementing it, and running it in production using the cloud, DevOps, and continuous delivery. He is the author of DevOps in Practice: Reliable and Automated Software Delivery, is a member of ThoughtWorks’ Technology Advisory Board and Office of the CTO, and is an experienced international conference speaker.

Presentations

Continuous intelligence: Moving machine learning into production reliably Tutorial

Danilo Sato and Christoph Windheuser walk you through applying continuous delivery (CD), pioneered by ThoughtWorks, to data science and machine learning. Join in to learn how to make changes to your models while safely integrating and deploying them into production, using testing and automation techniques to release reliably at any time and with a high frequency.

Volker Schnecke has almost 20 years’ experience of working in research and development in the pharmaceutical industry. His current role is in late-stage clinical development at Novo Nordisk in Denmark, where he focuses on exploiting observational data to support the obesity pipeline. His tasks cover the whole drug discovery and development value chain, from collaborating with preclinical researchers to producing evidence for marketing of new medicines.

Presentations

Using electronic health records to predict health risks associated with obesity DCS

Today, more than 650 million people worldwide are obese, and most of them will develop additional health issues during their lifetime. However, not all are at equal risk. Volker Schnecke discusses how Novo Nordisk mines the electronic health records (EHRs) of millions of patients to understand the risk in people with obesity and to support the discovery of new medicines.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Hands-on data science with Python 2-Day Training

Robert Schroll walks you through all the steps of developing a machine learning pipeline from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python.

Hands-on data science with Python (Day 2) Training Day 2

Robert Schroll walks you through all the steps of developing a machine learning pipeline from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python.

Max Schultze is a data engineer working on building a data lake at Zalando, Europe’s biggest online fashion retailer. His focus lies on building data pipelines at scale of terabytes per day and productionizing Spark and Presto as analytical platforms inside the company. He graduated from the Humboldt University of Berlin, actively taking part in the university’s initial development of Apache Flink.

Presentations

From legacy to cloud: An end-to-end data integration journey Session

Max Schultze details Zalondo's end-to-end data integration platform to serve analytical use cases and machine learning throughout the company, covering raw data collection, standardized data preparation (binary conversion, partitioning, etc.), user-driven analytics, and machine learning.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Foundations for successful data projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses a number of challenges. Join Colm Moynihan, Jonathan Seidman, and Michael Kohs to explore cloud architecture and challenges and learn how to use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Deb Seys is determined to democratize access to data—working at the intersection of people, technology and knowledge to make it happen, with a focus on how employees find, understand, and use data in their jobs. For over 20 years, she’s worked on applications, search engines, and websites that serve employees inside an enterprise, most recently at eBay. Her goal has always been to enable users to help themselves to find, consume, and collaborate with information and data. In other roles, she delivered search and taxonomy technology for the intranet at Kaiser Permanente and was a systems librarian at Hewlett-Packard Labs Research Library. Deb holds an MLIS from UC Berkeley.

Presentations

Data catalogs are changing the nature of working with data (sponsored by Alation) Session

Deb Seys shares the results of a study that she oversaw at eBay in collaboration with the Kellogg School of Management at Northwestern University. Examining the work of 2,000 analysts and almost 80,000 queries, the study revealed that a data catalog can be used as a learning platform that increases analyst productivity and creates a more collaborative approach to discovery and innovation.

Ben Sidhom is a software engineer on the Dataproc team at Google, improving the experience of autoscaling with Spark.

Presentations

Improving Spark downscaling; Or, Not throwing away all of our work Session

As more workloads move to severless-like environments, the importance of properly handling downscaling increases. Holden Karau, Mikayla Konst, and Ben Sidhom explore approaches for improving the scale-down experience on open source cluster managers—everything from how to schedule jobs to the location of blocks and their impact.

Rosaria Silipo is a principal data scientist at KNIME. She loved data before it was big and learning before it was deep. She’s spent 25+ years in applied AI, predictive analytics, and machine learning at Siemens, Viseca, Nuance Communications, and private consulting. Rosaria shares her practical experience in a broad range of industries and deployments, including IoT, customer intelligence, financial services, and cybersecurity, and through her 50+ technical publications, including her recent ebook, Practicing Data Science: A Collection of Case Studies. Follow her on Twitter, LinkedIn, and the KNIME blog.

Presentations

Practicing data science: A collection of case studies Session

Rosaria Silipo shares a collection of past data science projects. While the structure is often similar—data collection, data transformation, model training, deployment—each required its own special trick, whether a change in perspective or a particular technique to deal with special case and special business questions.

Alkis Simitsis is a chief scientist for cybersecurity analytics at Micro Focus. Alkis has more than 15 years of experience building innovative information and data management solutions in areas like real-time business intelligence, security, massively parallel processing, systems optimization, data warehousing, graph processing, and web services. He holds 26 US patents and has filed over 50 patent applications in the US and worldwide. He’s published more than 100 papers in refereed international journals and conferences (top publications cited 5,000+ times) and frequently serves in various roles in program committees of top-tier international scientific conferences. He’s also an IEEE senior member and a member of the ACM.

Presentations

A Magic 8 Ball for optimal cost and resource allocation for the big data stack Session

Cost and resource provisioning are critical components of the big data stack. Shivnath Babu and Alkis Simitsis detail how to build a Magic 8 Ball for the big data stack—a decomposable time series model for optimal cost and resource allocation that offers enterprises a glimpse into their future needs and enables effective and cost-efficient project and operational planning.

Rebecca Simmonds is a senior software engineer at Red Hat, where she’s part of an emerging technology group comprising both data scientists and developers. She has a keen interest in architecture design and data analysis, which she is furthering at Red Hat with Openshift and ML research. Previously, she was a Java developer creating solutions to improve performance for a CV analyzer at a small startup. She holds a PhD from Newcastle University, where she developed a platform for scalable geospatial and temporal analysis of Twitter data.

Presentations

Application intelligence: Bridging the gap between human expertise and machine learning Session

Artificial intelligence and machine learning are now popularly used terms, but how do you make use of these techniques without throwing away the valuable knowledge of experienced employees? Rebecca Simmonds and Michael McCune delve into this idea with examples of how distributed machine learning frameworks fit together naturally with business rules management systems.

Pete Skomoroch is the former head of data products at Workday and LinkedIn. He’s a senior executive with extensive experience building and running teams that develop products powered by data and machine learning. Previously, he was cofounder and CEO of venture-backed deep learning startup SkipFlag (acquired by Workday in 2018) and a principal data scientist at LinkedIn, the world’s largest professional network, with over 500 million members worldwide. As an early member of the data team, he led data science teams focused on reputation, search, inferred identity, and building data products. He was also the creator of LinkedIn Skills and LinkedIn Endorsements, one of the fastest-growing new product features in LinkedIn’s history.

Presentations

Executive Briefing: Why managing machines is harder than you think Session

In the next decade, companies that understand how to apply machine intelligence will scale and win their markets. Others will fail to ship successful AI products that matter to customers. Pete Skomoroch details how to combine product design, machine learning, and executive strategy to create a business where every product interaction benefits from your investment in machine intelligence.

Guoqiong Song is a senior deep learning software engineer on the big data technology team at Intel. She’s interested in developing and optimizing distributed deep learning algorithms on Spark. She holds a PhD in atmospheric and oceanic sciences with a focus on numerical modeling and optimization from UCLA.

Guoqiong Song是英特尔大数据技术团队的高级深度学习软件工程师。 她拥有加州大学洛杉矶分校的大气和海洋科学博士学位,专业方向是数值建模和优化。 她现在的研究兴趣是开发和优化分布式深度学习算法。

Presentations

LSTM-based time series anomaly detection using Analytics Zoo for Spark and BigDL Session

Collecting and processing massive time series data (e.g., logs, sensor readings, etc.) and detecting the anomalies in real time is critical for many emerging smart systems, such as industrial, manufacturing, AIOps, and the IoT. Guoqiong Song explains how to detect anomalies in time series data using Analytics Zoo and BigDL at scale on a standard Spark cluster.

Raghotham Sripadraj is senior data scientist at Ericsson. Raghotham is also a mentor for data science on Springboard. Previously, he headed the data science team at Treebo Hotels and was cofounder and data scientist at Unnati Data Labs, where he built end-to-end data science systems in the fields of fintech, marketing analytics, and event management. Before that, at Touchpoints Inc., he single-handedly built a data analytics platform for a fitness wearable company, and at SAP Labs, he was a core part of what is currently SAP’s framework for building web and mobile products, as well as a part of multiple company-wide events helping to spread knowledge both internally and to customers. Drawing on his deep love for data science and neural networks and his passion for teaching, Raghotham has conducted workshops across the world and given talks at a number of data science conferences. Apart from getting his hands dirty with data, he loves traveling, Pink Floyd, and masala dosas.

Presentations

Deep learning for fonts Session

Deep learning has enabled massive breakthroughs in offbeat tracks and has enabled better understanding of how an artist paints, how an artist composes music, and so on. Nischal Harohalli Padmanabha and Raghotham Sripadraj discuss their project Deep Learning for Humans and their plans to build a font classifier.

.

Presentations

Deep learning for speech synthesis: The good news, the bad news, and the fake news Session

Modern deep learning systems allow us to build speech synthesis systems with the naturalness of a human speaker. While there are myriad benevolent applications, this also ushers in a new era of fake news. Scott Stevenson explores the danger of such systems and details how deep learning can also be used to build countermeasures to protect against political disinformation.

Václav Surovec is a senior big data engineer and comanages the Big Data Department at Deutsche Telekom IT. The department’s more than 45 engineers deliver big data projects to Germany, the Netherlands, and the Czech Republic. Recently, he led the commercial roaming project. Previously, he worked at T-Mobile Czech Republic while he was still a student of Czech Technical University in Prague.

Presentations

Data science at Deutsche Telekom: Predicting global travel patterns and network demand Session

Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management.

Anna Szonyi is an engineering manager at Cloudera, where she established and manages the data interoperability team. Anna cares about enabling people to build high-quality software in a sustainable environment. Previously, she was a software engineer at Cloudera working on Apache Sqoop and worked on risk management systems at Morgan Stanley.

Presentations

Picking Parquet: Improved performance for selective queries in Impala, Hive, and Spark Session

The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. Anna Szonyi and Zoltán Borók-Nagy share the technical details of the design and its implementation along with practical tips to help data architects leverage these new capabilities in their schema design and performance results for common workloads.

Chris Taggart is the cofounder and CEO of OpenCorporates, the largest open database of companies in the world. OpenCorporates’s primary mission is to open up and connect company data from across the globe, making it more useful, usable, and understandable for the public benefit. OpenCorporates has already made a clear and significant impact. First, its database of over 160 million companies in 130 jurisdictions is a critical tool for investigative journalists, NGOs, academics, due diligence professionals and government agencies from across the globe. Notable users include the ICIJ’s Panama and Paradise Papers investigations, the Organised Crime and Corruption Reporting Project, the BBC, the Financial Times, the Times, Global Witness, and Transparency International. OpenCorporates has also been the leading force behind the push to make company registers open data for access to all—with numerous successes. This public benefit mission is supported by an innovative virtuous-circle, public-benefit business model, whereby the free public access is subsidized by commercial users who paid for data in bulk, having confidence in its quality due to the intrinsic many-eyes feedback loop. Commercial users include Mastercard, Capital One, Factset, Transferwise, and PwC.

Presentations

The unstoppable rise of white box data Keynote

Chris Taggart explains the benefits of white box data and outlines the structural shifts that are moving the data world toward it.

The unstoppable rise of white box data Findata

Chris Taggart explains the benefits of white box data and outlines the structural shifts that are moving the data world toward it.

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

Alex Thomas and Claudiu Branzan lead a hands-on introduction to scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working code base that you can change and improve.

Spark NLP in action: How Indeed applies NLP to standardize résumé content at scale Session

Alexander Thomas and Alexis Yelton demonstrate how to use Spark NLP and Apache Spark to standardize semistructured text, illustrated by Indeed's standardization process for résumé content.

Mike Tidmarsh is Ogilvy’s global chief technology officer, responsible for fostering an environment where some of the brightest and most innovative marketing and creative technologists can use their knowledge, experience, and passion to make a real impact on client results and brand reputation. Fascinated by the power and possibility that technology and data increasingly afford, Mike’s driven by the idea that technology does its best work when it’s steered to make a positive improvement to the human experience—and that now, more than ever before, the brands that get this blend of art and science worked out are the ones that will truly thrive. Mike advises key clients on digital transformation, martech strategy, and the realization of customer experience and engagement strategies through technology and data. He also sits on the partner advisory boards of a number of major top-tier martech experience and engagement platforms. Previously, he ran a number of key global high-tech accounts in the Asia Pacific region at Ogilvy and spent 11 years as a business, technology, change leader with Coopers & Lybrand and Deloitte Consulting, based in London and Sydney respectively. He’s worked extensively across the US, Europe, and Asia for a roster of blue chip clients including IBM, Lenovo, SC Johnson, Unilever, AstraZeneca, Dyno Nobel, Hoechst, Brambles Holdings, Avnet, Goodyear Tire and Rubber, and Dimension Data.

Presentations

Rise of the (advertising) machines Keynote

Ogilvy's Mike Tidmarsh looks at how data and AI are radically reshaping the world of marketing communications and explores the impacts—good and bad—for professionals and consumers alike.

Deepak Tiwari is the head of product management for data at Lyft, where he’s responsible for the company’s data vision as well as for building its data infrastructure, data platform, and data products. This includes Lyft’s streaming infrastructure for real-time decision making, geodata store and visualization, platform for machine learning, and core infrastructure for big data analytics. Previously, he was a product management leader at Google, where he worked on search, cloud, and technical infrastructure products. Deepak is passionate about building products that are driven by data, focus on user experience, and work at web scale. He holds an MBA from Northwestern’s Kellogg School of Management and a BT in engineering from the Indian Institute of Technology, Kharagpur.

Presentations

The Lyft data platform: Now and in the future Session

Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.

Teresa Tung is a managing director at Accenture, where she’s responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s Applied Intelligence Platform. Teresa is Accenture’s most prolific inventor with 170+ patent and applications. She holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

An Innovation Architecture industrializes AI from PoCs to production Session

Innovation is abundant as companies reimagine themselves as data-driven and AI-powered businesses. How do enterprises organize to move beyond numerous, often similar proofs of concept (PoCs) into production-quality products and services? Teresa Tung and Jean-Luc Chatelain explore Accenture’s Innovation Architecture, which manages PoCs and pilots through embedding into scalable, saleable solutions.

Executive Briefing: Using a domain knowledge graph to manage AI at scale Session

How do enterprises scale moving beyond one-off AI projects to making it reusable? Teresa Tung and Jean-Luc Chatelain explain how domain knowledge graphs—the technology behind today's internet search—can bring the same democratized experience to enterprise AI. They then explore other applications of knowledge graphs in oil and gas, financial services, and enterprise IT.

Sandeep Uttamchandani is the hands-on chief data architect and head of data platform engineering at Intuit, where he’s leading the cloud transformation of the big data analytics, ML, and transactional platform used by 3M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep held engineering roles at VMware and IBM and founded a startup focused on ML for managing enterprise systems. Sandeep’s experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production for IBM’s federal and Fortune 100 customers. Sandeep has received several excellence awards. He has over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, and USENIX. Sandeep is a regular speaker at academic institutions and conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He blogs on LinkedIn and his personal blog, Wrong Data Fabric. Sandeep holds a PhD in computer science from the University of Illinois at Urbana-Champaign.

Presentations

Half-correct and half-wrong collective data wisdom: 3 patterns to sanity Session

Teams today rely on dictionaries of collective wisdom—a mixed bag with regard to correctness: some datasets have accurate attribute details, while others are incorrect and outdated. This significantly impacts productivity of analysts and scientists. Sandeep Uttamchandani outlines three patterns to better manage data dictionaries.

Sandra Wachter is a lawyer and research fellow (assistant professor) in data ethics, AI, robotics, and internet regulation/cybersecurity at the Oxford Internet Institute at the University of Oxford, where she also teaches internet technologies and regulation. Sandra is also a fellow at the Alan Turing Institute in London; a fellow of the World Economic Forum’s Global Futures Council on Values, Ethics, and Innovation; an academic affiliate at the Bonavero Institute of Human Rights at Oxford’s Law Faculty; and a member of the Law Committee of the IEEE. Sandra serves as a policy advisor for governments, companies, and NGOs around the world on regulatory and ethical questions concerning emerging technologies. Her work has been featured in the Telegraph, the Financial Times, the Sunday Times, the Economist, Science, the BBC, the Guardian, Le Monde, New Scientist, Die Zeit, Der Spiegel, Sueddeutsche Zeitung, Endgadget, and Wired. In 2018, she won the O2RB Excellence in Impact Award and in 2017 the CognitionX AI superhero Award.

Sandra specializes in technology, IP, and data protection law as well as European, international, human rights, and medical law. She’s also interested in the legal and ethical aspects of robotics (e.g. surgical, domestic, and social robots) and autonomous systems (e.g., autonomous and connected cars), including liability, accountability, and privacy issues. Internet policy and regulation and cybersecurity issues are also at the heart of her research, where she addresses areas such as online surveillance and profiling, censorship, intellectual property law, and human rights and identity online. Previous work also looked at (bio)medical law and bioethics in areas such as interventions in the genome and genetic testing under the Convention on Human Rights and Biomedicine. Sandra studied at the University of Oxford and the University of Vienna and previously worked at the Royal Academy of Engineering and the Austrian Ministry of Health.

Presentations

Privacy, identity, and autonomy in the age of big data and AI Keynote

Big data analytics and AI draw nonintuitive and unverifiable inferences about the behaviors, preferences, and lives of individuals. These inferences draw on diverse and feature-rich data of unpredictable value and create new opportunities for discriminatory, biased, and invasive decision making. Sandra Wachter discusses how this expands potential victims of discrimination and potential harm.

Kai Wähner is a technology evangelist at Confluent. Kai’s areas of expertise include big data analytics, machine learning, deep learning, messaging, integration, microservices, the internet of things, stream processing, and blockchain. He’s regular speaker at international conferences such as JavaOne, O’Reilly Software Architecture, and ApacheCon and has written a number of articles for professional journals. Kai also shares his experiences with new technologies on his blog.

Presentations

Unleashing Apache Kafka and TensorFlow in hybrid architectures Session

How do you leverage the flexibility and extreme scale of the public cloud and the Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures that span multiple public clouds—or bridge your on-premises data center to the cloud? Join Kai Wähner to learn how to use technologies such as TensorFlow with Kafka’s open source ecosystem for machine learning infrastructures.

Chris Wallace is a data scientist at Cloudera Fast Forward Labs, where he works on making breakthroughs in machine intelligence accessible and applicable in the “real world.” He has previous experience doing data science in organizations both large (the UK NHS) and small (as the first employee at a tech startup). Chris likes building data products and cares deeply about making technology work for people, not vice versa. He holds a PhD in particle physics from the University of Durham.

Presentations

Federated learning: Machine learning with privacy on the edge Session

Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. Chris Wallace discusses the algorithmic solutions and the product opportunities.

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Tom Walwyn is an Engineering Manager for Data at Cloudflare. Here helps teams move data from the Cloudflare edge to its core at global scale. He enjoys environments where the data to people ratio is large. When he’s not wrangling data, he enjoys running through London and surrounds.

Presentations

Simplicity at scale: How Cloudflare’s analyses some of the world’s largest DDoS attacks Session

Cloudflare powers nearly 10 percent of all Internet requests worldwide, absorbing some of the largest DDoS attacks. Learn how we use ClickHouse and SQL to simplify our data pipelines on a global scale while experiencing over 10 million events per second.

Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He’s head of developer relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he’s the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He earned his PhD in physics from the University of Washington.

Presentations

Executive Briefing: What it takes to use machine learning in fast data pipelines Session

Your team is building machine learning capabilities. Dean Wampler demonstrates how to integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed and covers challenges such as how to build long-running services that are very reliable and scalable and how to combine a spectrum of very different tools, from data science to operations.

Hands-on machine learning with Kafka-based streaming pipelines Tutorial

Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipelines and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques.

Moshe Wasserblat is the Natural Language Processing and Deep Learning Research Group manager for Intel’s Artificial Intelligence Products Group. Previously, he was with NICE Systems for more than 17 years, where he founded and led the speech and text analytics research team. His interests are in the field of speech processing and natural language processing. He was the cofounder and coordinator of the EXCITEMENT FP7 ICT program and served as organizer and manager of several initiatives, including many Israeli chief scientist programs. He has filed more than 60 patents in the field of language technology and has several publications in international conferences and journals. His areas of expertise include speech recognition, conversational natural language processing, emotion detection, speaker separation, speaker recognition, deep learning, and machine learning.

Presentations

NLP Architect by Intel's AI Lab Session

Moshe Wasserblat offers an overview of NLP Architect, an open source DL NLP library that provides SOTA NLP models, making it easy for researchers to implement NLP algorithms and for data scientists to build NLP-based solutions for extracting insight from textual data to improve business operations.

Sophie Watson is a senior data scientist at Red Hat, where she helps customers use machine learning to solve business problems in the hybrid cloud. She’s a frequent public speaker on topics including machine learning workflows on Kubernetes, recommendation engines, and machine learning for search. Sophie earned her PhD in Bayesian statistics.

Presentations

Learning "learning to rank" Session

Identifying relevant documents quickly and efficiently enhances both user experience and business revenue every day. Sophie Watson demonstrates how to implement learning-to-rank algorithms and provides you with the information you need to implement your own successful ranking system.

Thomas Weise is a software engineer for the streaming platform at Lyft. He’s also a PMC member for the Apache Apex and Apache Beam projects and has contributed to several more projects within the ASF ecosystem. Thomas is a frequent speaker at international big data conferences and the author of Learning Apache Apex.

Presentations

Streaming at Lyft Session

Fast data and stream processing are essential for making Lyft rides a good experience for passengers and drivers. Lyft's systems need to track and react to event streams in real time to update locations, compute routes and estimates, balance prices, and more. Thomas Weise offers an overview of the streaming platform that powers these use cases.

Charlotte Werger is head of data science at at Van Lanschot Kempen, where she’s challenged to transform the wealth manager and private bank from a traditional company into a cutting-edge data-driven one. Charlotte works at the intersection of artificial intelligence and finance. After completing her PhD at the European University Institute in Florence, she was a portfolio manager and quant researcher at BlackRock and Man AHL in London, where she was part of an early movement in asset management that initiated the application of machine learning models to predict financial markets. She then worked for ASI Data Science, helping its clients build AI applications and software. Charlotte is internationally active in the field of data science and AI education. She’s an instructor at Datacamp, mentors data science students on the Springboard platform, and holds an advisory role at Ryelore AI.

Presentations

Data science transformation: Transforming a traditional wealth manager to a cutting-edge data-driven company Findata

Charlotte Werger outlines the components necessary to transform a traditional wealth manager into a data-driven business, paying special attention to devising and executing a transformation strategy by identifying key business subunits where automation and improved predictive modeling can result in significant gains and synergies.

Elliot West is a principal engineer at Hotels.com in London, where he designs tooling and platforms in the big data space. Previously, Elliot worked on Last.fm’s data team, developing services for managing large volumes of music metadata.

Presentations

Herding elephants: Seamless data access in a multicluster clouds Session

Travel platform Expedia Group likes to give its data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. Pradeep Bhadani and Elliot West explain how the company built a unified virtual data lake on top of its many heterogeneous and distributed data platforms.

Mutant tests too: The SQL Session

Elliot West and Jay Green share approaches for applying software engineering best practices to SQL-based data applications to improve maintainability and data quality. Using open source tools, Elliot and Jay show how to build effective test suites for Apache Hive code bases and offer an overview of Mutant Swarm, a tool to identify weaknesses in tests and to measure SQL code coverage.

Arif Wider is a lead consultant and developer at ThoughtWorks Germany, where he enjoys building scalable applications, teaches Scala, and consults at the intersection of data science and software engineering. Previously, he was a researcher with a focus on data synchronization, bidirectional transformations, and domain-specific languages.

Presentations

Continuous intelligence: Keeping your AI application in production Session

Machine learning can be challenging to deploy and maintain. Any delays in moving models from research to production mean leaving your data scientists' best work on the table. Arif Wider and Emily Gorcenski explore continuous delivery (CD) for AI/ML along with case studies for applying CD principles to data science workflows.

Alicia Williams is an advocate for Google Cloud. Previously, she spent six years as a program manager; through building, managing, and measuring programs and processes, she fell in love with data science. Known to hang out in spreadsheets surrounded by formulas, she also uses machine learning, SQL, and visualizations to help solve problems and tell stories.

Presentations

Building custom machine learning models for production, without ML expertise DCS

You don’t need to be an expert to bring ML to your business. Alicia Williams explains how two media companies used pretrained models and AutoML to organize content and make it accessible around the world. Along the way, she details the business problems they solved with ML, demonstrates the ease of use of the tools themselves, and shows the value that ML has brought in each case.

Christoph Windheuser is the global head of intelligent empowerment at ThoughtWorks, where he’s responsible for the company’s positioning on data management, machine learning, and artificial intelligence. Previously, he held a number of positions in the IT industry at companies like SAP and Capgemini. Christoph studied computer science in Bonn (Germany), Pittsburgh (USA), and Paris (France), and he holds a PhD in speech recognition with artificial neural networks.

Presentations

Continuous intelligence: Moving machine learning into production reliably Tutorial

Danilo Sato and Christoph Windheuser walk you through applying continuous delivery (CD), pioneered by ThoughtWorks, to data science and machine learning. Join in to learn how to make changes to your models while safely integrating and deploying them into production, using testing and automation techniques to release reliably at any time and with a high frequency.

Mingxi Wu is the vice president of engineering at TigerGraph, a Silicon Valley-based startup building a world-leading real-time graph database. Over his career, Mingxi has focused on database research and data management software. Previously, he worked in Microsoft’s SQL Server Group, Oracle’s Relational Database Optimizer Group, and Turn Inc.‘s Big Data Management Group. Lately, his interest has turned to building an easy-to-use and highly expressive graph query language. He’s won research awards from the most prestigious publication venues in database and data mining, including SIGMOD, KDD, and VLDB, and has authored five US patents with three more international patents pending. Mingxi holds a PhD specializing in both database and data mining from the University of Florida.

Presentations

8 prerequisites of a graph query language Session

Graph query language is the key to unleash the value from connected data. Mingxi Wu outlines the eight prerequisites of a practical graph query language, drawn from six years' experience dealing with real-world graph analytical use cases. Along the way, Mingxi compares GSQL, Gremlin, Cypher, and SPARQL, pointing out their respective pros and cons.

Jerry Xu is cofounder and CTO at Datatron Technologies. An innovative software engineer with extensive programming and design experience in storage systems, online services, mobile, distributed systems, virtualization, and OS kernels, Jerry also has a demonstrated ability to direct and motivate a team of software engineers to complete projects meeting specifications and deadlines. Previously, he worked at Zynga, Twitter, Box, and Lyft, where he built the company’s ETA machine learning model. Jerry is the author of open source project LibCrunch. He’s a three-time Microsoft Gold Star Award winner.

Presentations

Model governance and model ops in the enterprise Session

Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them.

Chendi Xue is a software engineer on the data analytics team at Intel. She has more than five years’ experience in big data and cloud system optimization, focusing on storage, network software stack performance analysis, and optimization. She participated in the development works including Spark-Shuffle optimization, Spark-SQL ColumnarBased execution, compute side cache implementation, storage benchmark tool implementation, etc. Previously, she worked on Linux device mapper optimization and iSCSI optimization during her master degree study.

Presentations

Big data analytics in the public cloud: Challenges and opportunities Session

Jian Zhang, Chendi Xue, and Yuan Zhou explore the challenges of migrating big data analytics workloads to the public cloud (e.g., performance lost and missing features) and demonstrate how to use a new in-memory data accelerator leveraging persistent memory and RDMA NICs to resolve this issues and enable new opportunities for big data workloads on the cloud.

Itai Yaffe is a big data tech lead at Nielsen Identity Engine, where he deals with big data challenges using tools like Spark, Druid, Kafka, and others. He’s also a part of the Israeli chapter’s core team of Women in Big Data. Itai is keen about sharing his knowledge and has presented his real-life experience in various forums in the past.

Presentations

Stream, stream, stream: Different streaming methods with Spark and Kafka Session

NMC (Nielsen Marketing Cloud) provides customers (both marketers and publishers) with real-time analytics tools to profile their target audiences. To achieve that, the company needs to ingest billions of events per day into its big data stores in a scalable, cost-efficient way. Itai Yaffe explains how NMC continuously transforms its data infrastructure to support these goals.

Alexis Yelton is a data scientist at Indeed focusing on building machine learning models for software products. She’s been working with Spark since version 1.6 and has recently moved into the NLP space. She holds a PhD in bioinformatics and did postdoctoral work building models to predict gene function and explain ecosystem function.

Presentations

Spark NLP in action: How Indeed applies NLP to standardize résumé content at scale Session

Alexander Thomas and Alexis Yelton demonstrate how to use Spark NLP and Apache Spark to standardize semistructured text, illustrated by Indeed's standardization process for résumé content.

Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.

Presentations

Big data analytics in the public cloud: Challenges and opportunities Session

Jian Zhang, Chendi Xue, and Yuan Zhou explore the challenges of migrating big data analytics workloads to the public cloud (e.g., performance lost and missing features) and demonstrate how to use a new in-memory data accelerator leveraging persistent memory and RDMA NICs to resolve this issues and enable new opportunities for big data workloads on the cloud.

Weifeng Zhong is a senior research fellow at the Mercatus Center at George Mason University. His work focuses on bridging the field of natural language processing and machine learning to economic policy studies. His other research interests include the political economy, US-China economic relations, and China’s economic issues. Weifeng is a core maintainer of the open source Policy Change Index (PCI) project, a framework that uses machine learning to “read” large volumes of text and detect subtle, structural changes embedded in it. As a first use case, the PCI for China is an algorithm that can predict China’s policy changes using the information in the government’s official newspaper. The PCI framework has received significant academic interest and media coverage. The resources of this project are freely available at Policychangeindex.org. Weifeng has been published in a variety of scholarly journals, including the Journal of Institutional and Theoretical Economics. His research and writings have been featured in the Financial Times, Foreign Affairs, The National Interest, Real Clear Markets, Real Clear Politics, the South China Morning Post, and the Wall Street Journal, among others.

Presentations

Reading China: Predicting policy change with machine learning Session

Weifeng Zhong shares a machine learning algorithm built to “read” the People’s Daily (the official newspaper of the Communist Party of China) and predict changes in China’s policy priorities. The output of this algorithm, named the Policy Change Index (PCI) of China, turns out to be a leading indicator of the actual policy changes in China since 1951.

Yuan Zhou is a senior software development engineer in the Software and Service Group at Intel, where he works on the Open Source Technology Center team primarily focused on big data storage software. He’s been working in databases, virtualization, and cloud computing for most of his 7+ year career at Intel.

Presentations

Big data analytics in the public cloud: Challenges and opportunities Session

Jian Zhang, Chendi Xue, and Yuan Zhou explore the challenges of migrating big data analytics workloads to the public cloud (e.g., performance lost and missing features) and demonstrate how to use a new in-memory data accelerator leveraging persistent memory and RDMA NICs to resolve this issues and enable new opportunities for big data workloads on the cloud.

Xiaoyong Zhu is a senior data scientist at Microsoft, where he focuses on distributed machine learning and its applications.

Presentations

Inclusive design: Deep learning on audio in Azure, identifying sounds in real time Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.