Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Speakers

Hear from innovative CxOs, talented data practitioners, and senior engineers who are leading the data industry. More speakers will be announced; please check back for updates.

Filter

Search Speakers

Peter Aiken is an acknowledged Data Management (DM) authority. As a practicing data consultant, professor, author and researcher, he has studied DM for more than 30 years. International recognition has come from assisting more than 150 organizations in 30 countries including some of the world’s most important. He is a dynamic presence at events and author of 10 books and multiple publications, including his latest on Data Strategy. Peter also hosts the longest running and most successful webinar series dedicated to DM (hosted by dataversity.net). In 1999, he founded Data Blueprint Inc, a consulting firm that helps organizations leverage data for profit, improvement, competitive advantage and operational efficiencies. He is also Associate Professor of Information Systems at Virginia Commonwealth University (VCU), past President of the International Data Management Association (DAMA-I) and Associate Director of the MIT International Society of Chief Data Officers. Peter also hosts the longest running and most successful webinar series dedicated to DM (hosted by dataversity.net). In 1999, he founded Data Blueprint Inc, a consulting firm that helps organizations leverage data for profit, improvement, competitive advantage and operational efficiencies. He is also Associate Professor of Information Systems at Virginia Commonwealth University (VCU), past President of the International Data Management Association (DAMA-I) and Associate Director of the MIT International Society of Chief Data Officers.

Presentations

Your Data Strategy: It Should Be Concise, Actionable, and Understandable by Business and IT! Tutorial

The presents a more operational perspective on the use of data strategy that is especially useful for organizations just getting started with data

Alasdair Allan is a scientist and researcher who has authored over eighty peer reviewed papers, eight books, and has been involved with several standards bodies. Originally an astrophysicist he now works as a consultant and journalist, focusing on open hardware, machine learning, big data, and emerging technologies — with expertise in electronics, especially wireless devices and distributed sensor networks, mobile computing, and the "Internet of Things.” He runs a small consulting company, and has written for Make: Magazine, Motherboard/VICE, Hackaday, Hackster.io, and the O’Reilly Radar. In the past he has mesh networked the Moscone Center, caused a U.S. Senate hearing, and contributed to the detection of what was—at the time—the most distant object yet discovered.

Presentations

Executive Briefing: The Intelligent Edge and the Demise of Big Data? Session

A arrival of new generation of smart embedded hardware may cause the demise of large scale data harvesting. In its place smart devices will allow us process data at the edge, allowing us to extract insights from the data without storing potentially privacy and GDPR infringing data. The current age where privacy is no longer "a social norm" may not long survive the coming of the Internet of Things.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Professional Kafka Development 2-Day Training

Takes a participant through an in-depth look at Apache Kafka. We show how Kafka works and how to create real-time systems with it. It shows how to create consumers and publishers in Kafka. The we look at Kafka’s ecosystem and how each one is used. We show how to use Kafka Streams, Kafka Connect, and KSQL.

Professional Kafka Development (Day 2) Training Day 2

Takes a participant through an in-depth look at Apache Kafka. We show how Kafka works and how to create real-time systems with it. It shows how to create consumers and publishers in Kafka. The we look at Kafka’s ecosystem and how each one is used. We show how to use Kafka Streams, Kafka Connect, and KSQL.

Eitan Anzenberg is the Chief Data Scientist at Flowcast AI, a seed stage fintech startup in San Francisco. He leads the data science efforts including machine learning explanations, interpretability and what-if scenario analysis. His background is in machine learning, statistical learning and programming. Eitan obtained his PhD of Physics from Boston University in 2012 and finished his Postdoc from Lawrence Berkeley National Lab in 2014.

Presentations

Explainable Machine Learning in Fintech Session

Machine learning applications balance interpretability and performance. Linear models provide formulas to directly compare the influence of the input variables, while non-linear algorithms produce more accurate models. We utilize "what-if" scenarios to calculate the marginal influence of features per prediction and compare with standardized methods such as LIME.

Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Presentations

A Magic 8-Ball for Optimal Cost and Resource Allocation for the Big Data Stack Session

Cost and resource provisioning are critical components of the big data stack. A magic 8-ball for the big data stack would give an enterprise a glimpse into its future needs and would enable effective and cost-efficient project and operational planning. This talk covers how to build that magic 8-ball, a decomposable time-series model, for optimal cost and resource allocation for the big data stack.

Jason Bell is a machine learning engineer specializing in high-volume streaming systems, big data solutions, and machine learning applications. Jason was section editor for Java Developer’s Journal, has contributed to IBM developerWorks on autonomic computing, and is the author of Machine Learning: Hands On for Developers and Technical Professionals.

Presentations

Learning how to perform ETL data migrations with open source tool Embulk. Session

The Embulk data migration tool offers a convenient way to load data in to a variety of systems with basic configuration. This talk gives an overview of the Embulk tool and shows some common data migration scenarios that a data engineer could employ using the tool.

Francine Bennett is a data scientist and the CEO and cofounder of Mastodon C, a group of Agile big data specialists who offer the open source Hadoop-powered technology and the technical and analytical skills to help organizations to realize the potential of their data. Before founding Mastodon C, Francine spent a number of years working on big data analysis for search engines, helping them to turn lots of data into even more money. She enjoys good coffee, running, sleeping as much as possible, and exploring large datasets.

Presentations

Using data for evil V: the AI strikes back Session

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis (and possibly AI). Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse.

Daniel works in the Developer Relations team at Google. With more than fifteen years of experience in the software industry, Daniel has held positions at companies such as Ericsson and Opera Software. Daniel holds a Bachelor’s degree in Computer Science from Uppsala University. He lives in Stockholm and likes to spend his spare time freediving.

Presentations

Processing 10M samples/second to drive smart maintenance in complex IIoT systems Session

Learn how Cognite is developing IIoT smart maintenance systems that can process 10M samples/second from thousands of sensors. We’ll review an architecture designed for high performance, robust streaming sensor data ingest and cost-effective storage of large volumes of time series data, best practices for aggregation and fast queries, and achieving high-performance with machine learning.

I am currently working on query optimizations and resource utilization in Apache Spark at Qubole.

Presentations

Scalability aware autoscaling of spark application Session

Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Scalability aware autoscaling aims to use historical information to make better scaling decisions. In this talk we will talk about (1) Measuring efficiency of autoscaling policies and (2) coming up with more efficient autoscaling policies, in terms of latency and costs.

Pradeep is a Big Data Engineer at Hotels.com in London where he builds and manages cloud infrastructure and core services like Apiary. Pradeep has worked in the big data space for the last 7 years, building large scale platforms.

Presentations

Herding Elephants: Seamless data access in a multi-cluster clouds Session

Expedia Group is a travel platform with an extensive portfolio including Expedia.com and Hotels.com. We like to give our data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. We'll explain how we built a unified virtual data lake on top of our many heterogeneous and distributed data platforms.

Presentations

Time Series Forecasting with Azure Machine Learning service Tutorial

Time series modeling and forecasting has fundamental importance to various practical domains and, during the past few decades, machine learning model-based forecasting has become very popular in the private and the public decision-making process. In this tutorial, we will walk you through the core steps for using Azure Machine Learning to build and deploy your time series forecasting models.

Wojciech Biela is a co-founder of Starburst and is responsible for product development. He has a background of over 14 years of building products and running engineering teams.

Previously Wojciech was the Engineering Manager at the Teradata Center for Hadoop, running the Presto engineering operations in Warsaw, Poland. Prior to that, back in 2011, he built and ran the Polish engineering team, a subsidiary of Hadapt Inc., a pioneer in the SQL-on-Hadoop space. Hadapt was acquired by Teradata in 2014. Earlier, Wojciech built and lead teams on multi-year projects, from custom big e-commerce & SCM platforms to PoS systems.

Wojciech holds a M.S. in Computer Science from the Wroclaw University of Technology.

Presentations

Presto. Cost-Based Optimizer for interactive SQL-on-Anything Session

Presto is a popular open source distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3/Azure ADSL, RDBMS, no-SQL, etc). Recently Starburst has contributed the Cost-Based Optimizer for Presto which brings a great performance boost for Presto. Learn about this CBO’s internals, the motivating use cases and observed improvements.

Alun Biffin received his PhD in condensed matter physics at the University of Oxford before being awarded a Marie Curie Fellowship and going on to the Paul Scherrer Institute, Switzerland, to continue his research. He designed and conducted ground-breaking experiments on quantum magnets at cutting edge facilities in Europe, the US and Japan, and presented his work at international workshops and conferences. During this time he published three papers as first author and has been cited over 100 times. He went on to be chosen for the highly selective ASI Data Science Fellowship, London in the summer of 2018. Since then he has been applying his passion for machine learning to real-life business problems; ranging from analyzing millions of webhits for online retailer Tails.com, to predicting customer behavior at one of the Netherland’s largest private banks, and his current employer, Van Lanschot Kempen.

Presentations

Using Machine Learning for Stock Picking Session

In this talk we describe how machine learning revolutionized the stock picking process for portfolio managers at Kempen Capital Management by filtering the vast small-cap, investment universe down to a handful of optimal stocks.

Peter is a principal director at Accenture Belux specialized in data-driven architectures and solutions. With 15 years of experience, he is working mainly with Financial Services clients helping them adapt to the growing importance of data in today’s digital context. With a passion for innovation, Peter is leading the assets and offerings around data-driven architectures. He applies these at our clients increasing their ability to automate more and more decisions and interactions towards clients, prospects, suppliers as well as employees. Peter is a strong advocate for the power of metadata and believe that if our way in dealing with this changes , companies are able to drive automation to a new level. This will allow to combine both delivery as well as solution automation from the design phase resulting in many efficiency and effectiveness benefits.

Presentations

Leveraging metadata for automating delivery and operations of advanced data platforms Session

In this session we will explain how to use metadata to automate delivery and operations of a data platform. By injecting automation into the delivery processes we shorten the time-to-market while improving the quality of the initial user experience. Typical examples include: Data profiling and prototyping, Test automation, Continuous delivery and deployment, Automated code creation

David is passionate about helping businesses to build analytics-driven decision making to help them make quicker, smarter and bolder decisions. He leads customer strategy and insights at Harrods, the biggest and most iconic department store in Europe. He has previously built global analytics and insight capabilities for a number of leading global entertainment businesses covering television (the BBC), book publishing (HarperCollins Publishers) and the music industry (EMI Music), helping to drive each organization’s decision making at all levels. He builds on experiences working to build analytics for global retailers as well as political campaigns in the US and UK, in philanthropy and in strategy consulting.

Presentations

Keynote with David Boyle Keynote

David Boyle, Customer Insights Director, Harrods

Claudiu Branzan is the vice president of data science and engineering at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial for scalable NLP using the highly performant, highly scalable open-source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

James Burke has been called “One of the most intriguing minds in the Western world” (Washington Post). His audience is global. His influence in the field of the public understanding of science and technology is acknowledged in citations by such authoritative sources as the Smithsonian and Microsoft CEO Bill Gates. His work is on the curriculum of universities and schools across the United States.

In 1965 James Burke began work with BBC-TV on Tomorrow’s World and went on to become the BBC’s chief reporter on the Apollo Moon missions. For over forty years he has produced, directed, written and presented award-winning television series on the BBC, PBS, Discovery Channel and The Learning Channel. These include historical series, such as Connections (aired in 1979, it achieved the highest-ever documentary audience); The Day the Universe Changed; Connections2 and Connections3; a one-man science series, The Burke Special; a mini-series on the brain, The Neuron Suite; a series on the greenhouse effect, After the Warming; and a special for the National Art Gallery on Renaissance painting, Masters of Illusion.

A bestselling author, his publications include: Tomorrow’s World, Tomorrow’s World II, Connections, The Day the Universe Changed, Chances, The Axemaker’s Gift (with Robert Ornstein), The Pinball Effect, The Knowledge Web, Circles, and American Connections. He has also written a series of introductions for the book Inventing Modern America (MIT, 2002) and was a contributing author to Talking Back to the Machine (Copernicus, 1999) and Leading for Innovation (Drucker Foundation, 2002).

His book, Twin Tracks: The Unexpected Origins of the Modern World, focuses on the surprising connections among the seemingly unconnected people, events and discoveries that have shaped our world. Burke has also written and hosted a bestselling CD-ROM, Connections: A Mind Game and provided consult and scripting for Disney Epcot.

Burke is a frequent keynote speaker on the subject of technology and social change to audiences such as NASA, MIT, IBM, Microsoft, US Government Agencies and the World Affairs Council. He has also advised the National Academy of Engineering, The Lucas Educational Foundation and the SETI project.

He was a regular columnist for six years at Scientific American, and, most recently, contributed an essay on invention to the Britannica Online Encyclopedia. Burke is currently a contributor to TIME magazine. His most recent television work is a PBS retrospective of his work, ReConnections.

Educated at Oxford and holding honorary doctorates for his work in communicating science and technology, his latest project is an online interactive knowledge-mapping system (the ‘knowledge web’: www.k-web.org) to be used as a teaching aid, a management tool and a predictor. It is due to be online in 2020.

His next book, The Culture of Scarcity, will be published in 2020.

Presentations

Keynote with James Burke Keynote

Historian, Futurist, Author

Julia is an AI evangelist for Scout24 and actively driving the culture change within Scout24. Julia has a strong background in product development including data products, strategy and innovation. She is an initiator of forward thinking. She energizes through her creativity and enthusiasm.

Presentations

From data to data-driven to an AI-ready company - the culture change makes the difference DCS

To create value out of your data it is not about technology or engineers. It is all about changing the culture in the company to make everyone aware about data and how to build on top of data. At Scout24 we running a successful culture change and already have 60% of employees using our central BI tool. Since 2018 it is all about AI enablement.

Dr Paris Buttfield-Addison is co-founder of Secret Lab, a game development studio based in beautiful Hobart, Australia. Secret Lab builds games and game development tools, including the multi-award-winning ABC Play School iPad games, the BAFTA- and IGF-winning Night in the Woods, the Qantas airlines Joey Playbox games, and the open source Yarn Spinner narrative game framework. Previously, Paris was mobile product manager for Meebo (acquired by Google). Paris particularly enjoys game design, statistics, the blockchain, machine learning, and human-centered technology research and writes technical books on mobile and game development (more than 20 so far) for O’Reilly Media. He holds a degree in medieval history and a PhD in computing. Find him online at http://paris.id.au and @parisba

Presentations

Science-Fictional User Interfaces Session

Science-fiction has been showcasing complex, AI-driven (often AR or VR) interfaces (for huge amounts of data!) for decades. As television, movies, and video games became more capable of visualising a possible future, the grandeur of these imagined science fictional interfaces has increased. What can we learn from Hollywood UX? Is there a useful takeaway? Does sci-fi show the future of AI UX?

Tatiane Canero is Patient Flow manager at Hospital Israelita Albert Einstein in São Paulo, Brazil, being in charge the last 8 years for orchestrating all clinical and support services areas to care for over 85.000 patients yearly.
She has being responsible for implementing several process and clinical improvement initiatives aiming to release hospital capacity and maximize patient safety, experience and caring level. These initiatives have released additional 60 virtual beds capacity yearly.
Digital tech enthusiastic, she has being engaged in how AI can disrupt patient flow. The first relevant outcome was IRIS plataform, which she intends to extend to public hospitals managed by the hospital.

Presentations

Insightful Health - Amplifying Intelligence in Healthcare Patient Flow Execution Session

How Albert Einstein and Accenture evolved patient flow experience and efficiency with the use of applied AI, statistics and combinatorial math, allowing the hospital to antecipate E2E visibility within patient flow operations, from admission of emergency and ellective demands, to assignment and medical releases.

Data science expert and software system architect with expertise in machine-learning and big-data systems. Rich experiences of leading innovation projects and R&D activities to promote data science best practice within large organizations. Deep domain knowledge on various vertical use cases (Finance, Telco, Healthcare, etc.). Currently working pushing the cutting-edge application of AI at the intersection of high-performance database and IoT, focusing on unleashing the value of spatial-temporal data. I am also a frequent speaker at various technology conferences, including: O’Reilly Strata AI Conference, NVidia GPU Technology Conference, Hadoop Summit, DataWorks Summit, Amazon re:Invent, Global Big Data Conference, Global AI Conference, World IoT Expo, Intel Partner Summit, presenting keynote talks and sharing technology leadership thoughts.

Received my Ph.D. from the Department of Computer and Information Science (CIS), University of Pennsylvania, under the advisory of Professor Insup Lee (ACM Fellow, IEEE Fellow). Published and presented research paper and posters at many top-tier conferences and journals, including: ACM Computing Surveys, ACSAC, CEAS, EuroSec, FGCS, HiCoNS, HSCC, IEEE Systems Journal, MASHUPS, PST, SSS, TRUST, and WiVeC. Served as reviewers for many highly reputable international journals and conferences.

Presentations

Building The Data Infrastructure For The Internet Of Things At Zettabyte-Scale Session

We would like to share the architecture design and many detailed technology innovations of Alibaba TSDB, a state-of-the-art database for IoT data management, from years of development and continuous improvement.

Leading Business Intelligence Product Management at Uber.

Previously, a founding team and Senior Product Manager at ThoughtSpot. Helped build ThoughtSpot from 10 to 300+ people in 5 years. Created the world’s first analytics search engine at ThoughtSpot.

Education: University of California Berkeley, University of Illinois Urbana-Champaign, IIT Guwahati

Presentations

Integrated Business Intelligence Suite at Uber: How we built a platform to convert raw data into knowledge (insights) Session

Our experience with building the Business Intelligence platform has been nothing short of extraordinary. The proposal contains details about how Uber thought about building it's Business Intelligence platform. In this talk, I’ll narrate the journey of deciding on how we took a platform approach rather than adding features in a piecemeal fashion.

Zhiling is a ML engineer at GO-JEK, one of the fastest growing startups in Asia. She and her colleagues work on scaling machine learning and driving impact throughout the organization. Her focus is on improving the speed at which data scientists iterate, the accuracy and performance of their models, the scalability of the systems they build, and the impact they deliver.

Presentations

Unlocking insights in AI by building a feature platform Session

Features are key to driving impact with AI at all scales. By democratizing the creation, discovery, and access of features through a unified platform, organizations are able to dramatically accelerate innovation and time to market. Find out how GO-JEK, Indonesia's first billion-dollar startup, built a feature platform to unlock insights in AI, and the lessons they learned along the way.

Felix Cheung is an engineer at Uber and a PMC and committer for Apache Spark. Felix started his journey in the big data space about five years ago with the then state-of-the-art MapReduce. Since then, he’s (re-)built Hadoop clusters from metal more times than he would like, created a Hadoop distro from two dozen or so projects, and juggled hundreds to thousands of cores in the cloud or in data centers. He built a few interesting apps with Apache Spark and ended up contributing to the project. In addition to building stuff, he frequently presents at conferences, meetups, and workshops. He was also a teaching assistant for the first set of edX MOOCs on Apache Spark.

Presentations

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber Session

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame.

Divya Choudhary is a data scientist, currently working with a Jakarta based technology startup named GO-JEK. She is responsible for building algorithms and mathematical models to drive features across diversified products at GO-JEK.

With 4 years of work experience, Divya is a computer science engineer who has traversed her professional career from being an analyst to a decision scientist to a data scientist. The crux to any data science solution lies in having a problem-solving mindset & Divya has been known for her business acumen & problem solving approach across all the start-ups that she has been a part of.

Personal:

  • a yoga lover
  • a poetess
  • a painter
  • an avid trekker & wanderer who is best at talking to people and learning about them

Presentations

From random text in addresses to world class feature of precise locations using NLP Session

Data scientists around the globe would agree that addresses are the most unorganised textual data. Structuring addresses has almost led to a new stream of NLP itself. Who would've imagined that address text data can be used to develop one of the coolest product feature of finding the most precise pick up/drop-off locations for e-commerce, logistics, food delivery or ride/car services companies!

Ira Cohen is a cofounder and chief data scientist at Anodot, where he is responsible for developing and inventing the company’s real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Sequence-2-Sequence Modeling for Time Series Session

Recently, Sequence-2-Sequence has also been used for applications based on time series data. In this talk, we first overview S2S and the early use cases of S2S. Subsequently, we shall walk through how S2S modeling can be leveraged for the aforementioned use cases, viz., real-time anomaly detection and forecasting.

Robert Cohen is a senior fellow at the Economic Strategy Institute, where he is directing a new study to examine the economic and business impacts of machine learning and AI on firms and the U.S. economy.

Presentations

Data-driven digital transformation and Jobs: the New Software Hierarchy and ML Session

This talk describes the skills that employers are seeking from employees in digital jobs – linked to the new software hierarchy driving digital transformation. We describe this software hierarchy as one that ranges from DevOps, CI/CD, and microservices to Kubernetes and Istio. This hierarchy is used to define the jobs that are central to data-driven digital transformation.

Alex Combessie is a Data Scientist at Dataiku who designs and deploys data projects with Machine Learning from prototype to production. Prior to his time at Dataiku, he helped build the Data Science team at Capgemini Consulting in France. Having began his career in economic analysis, he continues to work on interpretable models in complement to Deep Learning. Alex is also a travel junkie, who enjoys learning new things and making useful products.

Presentations

Improving Infrastructure Efficiency with Unsupervised Algorithms Session

GRDF helps bring natural gas to nearly 11 million customers everyday. In partnership with GRDF, Dataiku worked to optimise the manual process of qualifying addresses to visit and ultimately save GRDF time and money. This solution was the culmination of a year-long adventure in the land of maintenance experts, legacy IT systems and agile development.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, Ian was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow 2-Day Training

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow (Day 2) Training Day 2

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Lidia Crespo is leading the Big Data Governance activities from the CDO team. She and her team have been instrumental to the adoption of the technology platform by creating a sense of trust and with their deep knowledge of the data of the organisation. With her experience in complex and challenging international projection projects and an audit, IT, and data background, Lidia brings a combination difficult to find.

Presentations

The vindication of Big data. How Hadoop is used in Santander UK to defend privacy. Session

Big data is usually regarded as a menace for data privacy. However, with the right principles and mind-set, it can be a game changer to put customers first and consider data privacy an inalienable right. Santander UK applied this model to comply with GDPR by using graph technology, Hadoop, Spark, Kudu to drive data obscuring and data portability, and driving machine learning exploration.

Samuel Cristobal, holds a MSc in Advanced Mathematics and Applications (Universidad Autónoma of Madrid); a BCs (with honors) in Mathematics (Universidad Complutense de Madrid); a BEng (valedictorian) in Telecommunication Systems (Universidad Politécnica de Madrid) and was a research associate fellow at the University of Vienna working on mathematical research with focus on algebraic geometry, logic and computer science.

Samuel has been a researcher at Innaxis for ten years, in which he successfully executed more than a dozen of Data Science projects in the field of aviation ranging from mobility to safety, mostly as the technical or scientific coordinator. Currently Samuel is the Science and Technology director at Innaxis, managing the research agenda of the institute.

Presentations

Machine Learning in aviation is finally taking off DCS

DataBeacon is a multi-sided data and machine learning platform for the aviation industry. Two applications will be presented: SmartRunway (machine learning solution to runway optimisation) and SafeOperations (operations safety predictive analytics).

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Data Case Studies Welcome Tutorial

Welcome to the Data Case Studies tutorial.

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the first day of keynotes.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the first day of keynotes.

Apurva joined Google over 3 years ago. He leads the Dataproc, Composer and CDAP products in the Data Analytics team. Prior to Google, Apurva was at Lenovo/Motorola leading their Mobile Cloud team for a year. Prior to that, he spent 3.5 years at Pivotal Software where he built and commercialized Pivotal’s Hadoop distribution. Prior to that he spent 6 years at Yahoo leading various search & display advertising efforts as well as the Hadoop solutions team. He holds a Master’s degree in EE from Simon Fraser University, B.C., Canada.

Presentations

Migrating Apache Oozie Workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems where the former focuses on Apache Hadoop jobs. We see a need to build oozie to Airflow workflow mapping as a part of creating an effective cross-cloud/cross-system solution. This talk aims to introduce an open-source Oozie-to-Airflow migration tool developed at Google.

Wolff Dobson is a developer programs engineer at Google specializing in machine learning and games. Previously, he worked as a game developer, where his projects included writing AI for the NBA 2K series and helping design the Wii Motion Plus. Wolff holds a PhD in artificial intelligence from Northwestern University.

Presentations

TensorFlow For Everyone Session

In this talk, we will cover the latest in TensorFlow, both for beginners and for developers migrating from 1.x to 2.0. We'll cover the best ways to set up your model, feed your data to it, and distribute it for fast training. We'll also look at how TensorFlow has been recently upgraded to be more intuitive.

David Dogon was born in Cape Town, South Africa, the same city where he completed a bachelor’s degree in Chemical Engineering. Being a bit of an adventurer he moved to New York to study his master’s in the same field at Columbia University. In 2012 he moved to the Netherlands and performed research towards a PhD degree in Mechanical Engineering at TU Eindhoven. Always being driven by an interest in the insights and predictive power from data, he made the shift to the broad field of data science in 2016. As a data scientist he has worked primarily in financial services. He joined Van Lanschot Kempen in 2018 as part of a the data science team, where he has a primary focus on investments and asset management.

Presentations

Using Machine Learning for Stock Picking Session

In this talk we describe how machine learning revolutionized the stock picking process for portfolio managers at Kempen Capital Management by filtering the vast small-cap, investment universe down to a handful of optimal stocks.

Mark Donsky leads product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogeneous data environments. Previously, Mark led data management and governance solutions at Cloudera. Mark has held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive briefing: big data in the era of heavy worldwide privacy regulations Session

General Data Protection Regulation (GDPR) goes into effect in May 2018 for firms doing any business in the EU. However many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.

Getting ready for GDPR and CCPA: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to CCPA.

Ted Dunning is chief application architect at MapR. He’s also a board member for the Apache Software Foundation, a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects, and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He has contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Report Card on Streaming Microservices Session

As a community, we have been pushing streaming architectures, particularly microservices, for several years now. But what are the results in the field? I will describe several (anonymized) case histories and describe the good, the bad and the ugly. In particular, I will describe how several teams who were new to big data fared by skipping map-reduce and jumping straight into streaming.

Ananth Packkildurai works as a Senior data engineer at Slack manage core data infrastructures like Airflow, Kafka, Flink, and Pinot. He is passionate about all things related to ethical data management and data engineering.

Presentations

Reliable logging infrastructure @ Slack Session

Logs are everywhere. Every organization collects tons of data every day. The logs are as good as the trust it earns to make business-critical decisions. Building trust and reliability of logs are critical to creating a data-driven organization. Ananth walkthrough his experience building reliable logging infrastructure at Slack and how it helped to build confidence on data.

Yoav drives product management, technology vision, and go-to-market activities for GigaSpaces. Prior to joining GigaSpaces, Yoav filled various leading product management roles at Iguazio and Qwilt, mapping the product strategy and roadmap while providing technical leadership regarding architecture and implementation. Yoav brings with him more than 12 years of industry knowledge in product management and software engineering experience from high growth software companies. As an entrepreneur at heart, Yoav drives innovation and product excellence and successfully incorporates it with the market trends and business needs. Yoav holds a BSC in Computer Science and Business from Tel Aviv University Magna Cum Laude and an MBA in Finance from the Leon Recanati School in Tel Aviv University.

Presentations

A Deep Learning Approach to Automatic Call Routing Session

Technological advancements are transforming customer experience, and businesses are beginning to benefit from Deep Learning innovations to automate call center routing to the most proper agent. This session will discuss how Deep Learning models can be run with Intel BigDL and Spark frameworks co-located on an in-memory computing platform to enhance the customer experience without the need for GPUs

As CTO, Geir leads the R&D department in developing the Cognite industrial IoT data platform. Geir was founder and CEO/CTO at Snapsale, a machine learning classifieds startup that was acquired by Schibsted. Prior to this, he worked three years as senior software engineer at Google in Canada, where he worked on machine learning for AdWords and AdSense, resulting in the Conversion Optimizer product. Geir has an MSc in computational science from the University of Oslo, and has won a silver medal from the International Olympiad in Informatics.

Presentations

Processing 10M samples/second to drive smart maintenance in complex IIoT systems Session

Learn how Cognite is developing IIoT smart maintenance systems that can process 10M samples/second from thousands of sensors. We’ll review an architecture designed for high performance, robust streaming sensor data ingest and cost-effective storage of large volumes of time series data, best practices for aggregation and fast queries, and achieving high-performance with machine learning.

Moty Fania is a principal engineer for big data analytics at Intel IT and the CTO of the Advanced Analytics Group, which delivers big data and AI solutions across Intel. With over 15 years of experience in analytics, data warehousing, and decision support solutions, Moty leads the development and architecture of various big data and AI initiatives, such as IoT systems, predictive engines, online inference systems, and more. Moty holds a bachelor’s degree in economics and computer science and a master’s degree in business administration from Ben-Gurion University.

Presentations

Building a Sales AI platform – key principles and lessons learned Session

In this session, Moty Fania will share his experience of implementing a Sales AI platform. It handles processing of millions of website pages and sifting thru millions of tweets per day. The platform is based on unique open source technologies and was designed for real-time, data extraction and actuation. This session highlights the key learnings with a thorough review of the architecture.

Fabio Ferraretto is Lead for Data Science for Accenture in Latin America, where he manage 215 creative, innovators and excited data scientists. He applies advanced analytics, optimization, combinatorial math, predictive and Artificial Intelligence to solve the complex and challenging business problems on clients from Healthcare, Telecom, CPG, Mining and other industries. He has a degree in Civil Engineering from Escola Politecnica from USP, and works in Accenture since 2002 applying analytics to business challenges. He lead Gapso Analytics acquisition in 2015 and its integration with Accenture.

Presentations

Insightful Health - Amplifying Intelligence in Healthcare Patient Flow Execution Session

How Albert Einstein and Accenture evolved patient flow experience and efficiency with the use of applied AI, statistics and combinatorial math, allowing the hospital to antecipate E2E visibility within patient flow operations, from admission of emergency and ellective demands, to assignment and medical releases.

Ilan Filonenko is a member of the Data Science Infrastructure team at Bloomberg, where he has designed and implemented distributed systems at both the application and infrastructure level. He is one of the principle contributors to Spark on Kubernetes, primarily focusing on the effort to enabled Secure HDFS interaction and non-JVM support. Previously, Ilan was an engineering consultant and technical lead in various startups and research divisions across multiple industry verticals, including medicine, hospitality, finance, and music. Ilan’s currently researches algorithmic, software, and hardware techniques for high-performance machine learning, with a focus on optimizing stochastic algorithms and model management.

Presentations

Cross-Cloud Model Training & Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Piotr Findeisen is a Software Engineer and a founding member of the Starburst team. He contributes to the Presto code base and is also active in the community. Piotr has been involved in the design and development of significant features like the cost-based optimizer (still in development), spill to disk, correlated subqueries and a plethora of smaller enhancements.

Before Starburst, Piotr worked at Teradata and was the top external Presto committer of the year. Prior to that, he was a Team Leader at Syncron (provider of cloud services for supply chain management), responsible for their product’s technical foundation and performance.

Piotr holds a M.S. in Computer Science (and a B.Sc. in Mathematics) from University of Warsaw.

Presentations

Presto. Cost-Based Optimizer for interactive SQL-on-Anything Session

Presto is a popular open source distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3/Azure ADSL, RDBMS, no-SQL, etc). Recently Starburst has contributed the Cost-Based Optimizer for Presto which brings a great performance boost for Presto. Learn about this CBO’s internals, the motivating use cases and observed improvements.

Marcel works as a software engineer in the analytics team of the Wikimedia Foundation since October 2014. He believes it’s a privillege to be able to professionally contribute to Wikipedia and the free knowledge movement. He’s also worked on quite disparate things such as recommender systems, serious games, natural language processing and… selling hand-painted t-shirts on the beach of Natal, Brazil.

Presentations

The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit Session

Analysts and researchers studying Wikipedia are hungry for long term data to build experiments and feed data-driven decisions. But Wikipedia has a strict privacy policy that prevents storing privacy-sensitive data over 90 days. The Wikimedia Foundation's analytics team is working on a vegan data diet to satisfy both.

Michael Freeman is a Senior Lecturer at the Information School at the University of Washington, where he teaches courses on data science, data visualization, and web development. With a background in public health, Michael works alongside research teams to design and build interactive data visualizations to explore and communicate complex relationships in large datasets. Previously, he was a data visualization specialist and research fellow at the Institute for Health Metrics and Evaluation, where he performed quantitative global health research and built a variety of interactive visualization systems to help researchers and the public explore global health trends. Michael is interested in applications of data visualization to social change. He holds a master’s degree in public health from the University of Washington. You can take a look at samples from his projects on his webiste.

Presentations

Visually Communicating Statistical and Machine Learning Methods Session

Statistical and machine learning techniques are only useful when they're understood by decision makers. While implementing these techniques is easier than ever, communicating about their assumptions and mechanics is not. In this session, participants will learn a design process for crafting visual explanations of analytical techniques and communicating them to stakeholders.

Brandy Freitas is a principal data scientist at Pitney Bowes, where she works with clients in a wide variety of industries to develop analytical solutions for their business needs. Brandy is a research physicist-turned-data scientist based in Boston, MA. Her academic research focused primarily on protein structure determination, applying machine learning techniques to single-particle cryoelectron microscopy data. Brandy is a National Science Foundation Graduate Research Fellow and a James Mills Pierce Fellow. She holds an undergraduate degree in physics and chemistry from the Rochester Institute of Technology and did her graduate work in biophysics at Harvard University.

Presentations

Executive Briefing: Analytics for Executives Session

Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. In this session, Harvard Biophysicist-turned-Data Scientist, Brandy Freitas, will work with participants to develop context and vocabulary around data science topics to help build a culture of data within their organization.

Ellen Friedman is Principal Technologist for MapR Technologies. She is a committer on the Apache Drill and Apache Mahout projects and coauthor of a number of short books on big data topics including AI and Analytics in Production, Machine Learning Logistics, Streaming Architecture, the Practical Machine Learning series, and Introduction to Apache Flink. Ellen has been an invited speaker at Strata Data conferences, Big Data London, Berlin Buzzwords, Nike Tech Talks, the University of Sheffield Methods Institute and NoSQL Matters Barcelona. She holds a PhD in biochemistry.

Presentations

Executive Briefing: 5 Things Every Executive Should *Not* Know Session

A surprising fact of modern technology is that not knowing some things can make you better at what you do. This isn’t just lack of distraction or being too delicate to face reality. It’s about separation of concerns, with a techno flavor. In this talk I go through five things that best practice with emerging technologies and new architectures can give us ways to not know, and why that’s important.

Matt is an engineering leader at Teradata working on the open source project, Presto. Previous to Teradata, Matt worked on the team that architected and developed Hadapt’s 2nd generation SQL-on-Hadoop query engine. Prior to Hadapt, Matt worked on Vertica’s distributed query optimizer. Matt holds a Masters degree in Computer Science from Brown University.

Presentations

Learning Presto: SQL-on-Anything Tutorial

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL-on-Anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from Gigabytes to Petabytes. In this tutorial, attendees will learn Presto usages, best practices, and optional hands on exercises.

Siddha Ganju is a data scientist at Deep Vision, where she works on building deep learning models and software for embedded devices. Siddha is interested in problems that connect natural languages and computer vision using deep learning. Her work ranges from visual question answering to generative adversarial networks to gathering insights from CERN’s petabyte scale data and has been published at top tier conferences like CVPR. She is a frequent speaker at conferences and advises the Data Lab at NASA. Siddha holds a master’s degree in computational data science from Carnegie Mellon University, where she worked on multimodal deep learning-based question answering. When she’s not working, you might catch her hiking.

Presentations

Deep learning on mobile Session

Bringing the power of convolutional neural networks (CNN) to memory- and power-constrained devices like smartphones and drones.

Marina Rose Geldard, more commonly known as Mars, is a technologist from Down Under in Tasmania. Entering the world of technology relatively late as a mature-age student, she has found her place in the world: an industry where she can apply her lifelong love of mathematics and optimization. When she is not busy being the most annoyingly eager student ever, she compulsively volunteers at industry events, dabbles in research, and serves on the executive committee for her state’s branch of the Australian Computer Society (ACS) as well as the AUC (http://auc.edu.au). She is currently writing ‘Practical Artificial Intelligence with Swift’, for O’Reilly Media, and working on machine learning projects to improve public safety through public CCTV cameras in her home town of Hobart.

Presentations

Science-Fictional User Interfaces Session

Science-fiction has been showcasing complex, AI-driven (often AR or VR) interfaces (for huge amounts of data!) for decades. As television, movies, and video games became more capable of visualising a possible future, the grandeur of these imagined science fictional interfaces has increased. What can we learn from Hollywood UX? Is there a useful takeaway? Does sci-fi show the future of AI UX?

Oliver Gindele is Head of Machine Learning at Datatonic. He studied Materials Science at ETH Zurich and moved to London to obtain his PhD in computational physics from UCL. Oliver is passionate about using computers models to solve real-world problems for which he joined Datatonic to create bespoke machine learning solutions. Working with clients in retail, finance and telecommunications Oliver applies deep learning techniques to tackle some of the most challenging use cases in these industries.

Presentations

Deep Learning for Recommender Systems Session

The success of Deep Learning has reached the realm of structured data in the past few years where neural network have shown to improve the effectiveness and predictability of recommendation engines. This session will give a brief overview of such deep recommender systems and how they can be implemented in TensorFlow.

Zachary Glassman is a data scientist in residence at the Data Incubator. Zachary has a passion for building data tools and teaching others to use Python. He studied physics and mathematics as an undergraduate at Pomona College and holds a master’s degree in atomic physics from the University of Maryland.

Presentations

Hands-On Data Science with Python 2-Day Training

We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into two applications from real-world datasets. All work will be done in Python.

Hands-On Data Science with Python (Day 2) Training Day 2

We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into two applications from real-world datasets. All work will be done in Python.

Emily has over ten years of experience in scientific computing and engineering research and development. She has a background in mathematical analysis, with a focus on probability theory and numerical analysis. She is currently working in Python development, though she has a background that includes C#/.Net, Unity3D, SQL, and MATLAB. In addition, she has experience in statistics and experimental design, and has served as Principal Investigator in clinical research projects.

Presentations

Continuous Intelligence: Keeping your AI Application in Production Session

Machine learning can be challenging to deploy and maintain. Data change, and both models and the systems that implement them must be able to adapt. Any delays moving models from research to production means leaving your data scientists' best work on the table. In this talk, we explore continuous delivery (CD) for AI/ML, and explore case studies for applying CD principles to data science workflows.

Sonal is the founder and CEO at Nube Technologies, a startup focussed on big data preparation and analytics. Nube Technologies builds business applications for better decision making through better data. Nube’s fuzzy matching product Reifier helps companies get a holistic view of enterprise data. By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data. We help our customers build better and effective models by ensuring that their underlying master data is accurate.

Presentations

Mastering Data with Spark and Machine Learning Session

Enterprise data on customers, vendors, products etc is siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting and 360 views. Traditional rule based MDM systems with legacy architectures struggle to unify this growing data. This talk covers a modern master data application using Spark, Cassandra, ML and Elastic.

Trevor Grant is committer on the Apache Mahout, and contributor on Apache Streams (incubating), Apache Zeppelin, and Apache Flink projects and Open Source Technical Evangelist at IBM. In former rolls he called himself a data scientist, but the term is so over used these days. He holds an MS in Applied Math and an MBA from Illinois State University. Trevor is an organizer of the newly formed Chicago Apache Flink Meet Up, and has presented at Flink Forward, ApacheCon, Apache Big Data, and other meetups nationwide.

Trevor was a combat medic in Afghanistan in 2009, and wrote an award winning undergraduate thesis between missions. He has a dog and a cat and a 64 Ford and he loves them all very much.

Presentations

Cross-Cloud Model Training & Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Jay is a final year student at King’s College London studying Computer Science. She joined Hotels.com in the Big Data Platform team for her industrial placement year where she spent time working with Apache Hive, modularization techniques for SQL, and mutation testing tools.

Presentations

Mutant Tests Too: The SQL Session

Hotels.com describe approaches for applying software engineering best practices to SQL-based data applications in order to improve maintainability and data quality. Using open source tools we show how to build effective test suites for Apache Hive code bases. We also present Mutant Swarm, a mutation testing tool we’ve developed to identify weaknesses in tests and to measure SQL code coverage.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Lyft Data Platform, Now and in the future Session

Lyft’s data platform is at the heart of Lyft’s business. Decisions all the way from pricing, to ETA, to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. In this talk, Mark Grover walks through various choices Lyft has made in the development and sustenance of the data platform and why along with what lies ahead in future.

Nischal HP is currently the VP of Engineering at Berlin based AI startup omnius which operates in the building of AI product for the insurance industry. Previously, he was a cofounder and data scientist at Unnati Data Labs, where he worked towards building end-to-end data science systems in the fields of fintech, marketing analytics, event management and medical domain. Nischal is also a mentor for data science on Springboard. During his tenure at former companies like Redmart and SAP, he was involved in architecting and building software for ecommerce systems in catalog management, recommendation engines, sentiment analyzers , data crawling frameworks, intention mining systems and gamification of technical indicators for algorithmic trading platforms. Nischal has conducted workshops in the field of deep learning and has spoken at a number of data science conferences like Oreilly strata San jose 2017, PyData London 2016, Pycon Czech Republic 2015, Fifthelephant India (2015 and 2016), Anthill, Bangalore 2016. He is a strong believer of open source and loves to architect big, fast, and reliable AI systems. In his free time, he enjoys traveling with his significant other, music and groking the web.

Presentations

Deep Learning for Fonts Session

Deep Learning has enabled massive breakthroughs in offbeat tracks and has enabled better understanding of how an artist paints, how an artist composes music and so on. As part of Nischal & Raghotham’s loved project - Deep Learning for Humans, they want to build a font classifier and showcase to masses how fonts : * Can be classified * Understand how and why two or more fonts are similar

Christian Hidber has a PhD in computer algebra from ETH Zurich. He did a postdoc at UC Berkeley where he researched online data mining algorithms.
Currently he applies machine learning to industrial hydraulics simulation, part of a product with 7000 installations in 42 countries.

Presentations

Reinforcement Learning: a Gentle Introduction & Industrial Application Session

Reinforcement learning (RL) learns complex processes autonomously like walking, beating the world champion in go or flying a helicopter. No big data sets with the “right” answers are needed: the algorithms learn by experimenting. We show “how” and “why” RL works in an intuitive fashion & highlight how to apply it to an industrial, hydraulics application with 7000 clients in 42 countries.

Mark Hinely, Esq., is Director of Regulatory Compliance at KirkpatrickPrice and a member of the Florida Bar, with 10 years of experience in data privacy, regulatory affairs, and internal regulatory compliance. His specific experiences include performing mock regulatory audits, creating vendor compliance programs and providing compliance consulting. He is also SANS certified in the Law of Data Security and Investigations.

As GDPR has become a revolutionary data privacy law around the world, Mark has become the resident GDPR expert at KirkpatrickPrice. He has led the GDPR charge through internal training, developing free, educational content, and performing gap analyses, assessments, and consulting services for organizations of all sizes.

Presentations

The Future of Data Privacy Law: It’s Getting Personal Session

Organizations across the globe are trying to determine whether GDPR applies to them. Now, it seems as though GDPR principles are headed to the US. In 2018 alone, more ten states have passed or amended consumer privacy and breach notification laws. Mark Hinely will provide insight on the current and future data privacy laws in the US and how they will impact organizations across the globe.

Ana Hocevar obtained her PhD in Physics before becoming a postdoctoral fellow at the Rockefeller University where she worked on developing and implementing an underwater touchscreen for dolphins. She has over 10 years of experience in physics and neuroscience research and over 5 years of teaching experience. Now she combines her love for coding and teaching as a Data Scientist in Residence at The Data Incubator.

Presentations

Machine Learning from Scratch in TensorFlow 2-Day Training

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. This training will introduce TensorFlow's capabilities in Python. It will move from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow with several hands-on applications.

Machine Learning from Scratch in TensorFlow (Day 2) Training Day 2

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. This training will introduce TensorFlow's capabilities in Python. It will move from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow with several hands-on applications.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

Protecting sensitive data in huge datasets: Cloud tools you can use Session

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. We will explore massive public datasets, taking you from theory to real life showcasing newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm (with options such as removing, masking, and coarsening).

Matthew Honnibal is the creator and lead developer of spaCy, one of the most popular libraries for Natural Language Processing. He has been publishing research on NLP since 2005, with a focus on syntactic parsing and other structured prediction problems. He left academia to start working on spaCy in 2014.

Presentations

Agile NLP workflows with spaCy and Prodigy Session

In this talk, I'll discuss "one weird trick" that can give your NLP project a better chance of success. The advice is this: avoid a "waterfall" methodology where data definition, corpus construction, modelling and deployment are performed as separate phases of work.

Christopher Hooi is the Deputy Director of Communications & Sensors at the Land Transport Authority of Singapore. He is passionate about harnessing big data innovations to address complex land transport issues. Since 2010, he has embarked on a long term digital strategy with the main aim of achieving smart urban mobility in a fast changing digital world. Central to this strategy is to build and sustain a land transport digital ecosystem through an extensive network of sensor feeds, analytical processes and commuter outreach channels, synergistically put together to deliver a people-centred land transport system.

Presentations

Early Incident Detection using Fusion Analytics of Commuter-Centric Data Sources Session

The Fusion Analytics for Public Transport Event Response (FASTER) system provides a real-time advanced analytics solution for early warning of potential train incidents. Using novel fusion analytics of multiple data sources, FASTER harnesses the use of engineering and commuter-centric IoT data sources to activate contingency plans at the earliest possible time and reduce impact to commuters.

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Solving data cleaning and unification using human-guided machine learning Session

Last year, we covered two primary challenges in applying machine learning to data curation: entity consolidation & using probabilistic inference to suggest data repair for identified errors and anomalies. This year, we'll cover these limitations in greater detail and explain why data unification projects common to quickly require human guided machine learning and a probabilistic model.

Amir Issaei is a Data Science Consultant at Databricks. He educates customers on how to leverage Databricks’ Unified Analytics Platform in Machine Learning (ML) projects. He also helps customers to implement ML solutions and to use Advanced Analytics to solve business problems. Before joining Databricks, he worked at American Airlines’ Operations Research department, where he supported Customer Planning, Airport and Customer Analytics groups. He received an MS in Mathematics from University of Waterloo and a Bachelor of Engineering Physics from University of British Columbia.

Presentations

Large-Scale ML with MLflow, Deep Learning and Apache Spark 2-Day Training

The course covers the fundamentals of neural networks and how to build distributed Keras/TensorFlow models on top of Spark DataFrames. Throughout the class, you will use Keras, TensorFlow, Deep Learning Pipelines, and Horovod to build and tune models. You will also use MLflow to track experiments and manage the machine learning lifecycle. NOTE: This course is taught entirely in Python.

Large-Scale ML with MLflow, Deep Learning and Apache Spark (Day 2) Training Day 2

The course covers the fundamentals of neural networks and how to build distributed Keras/TensorFlow models on top of Spark DataFrames. Throughout the class, you will use Keras, TensorFlow, Deep Learning Pipelines, and Horovod to build and tune models. You will also use MLflow to track experiments and manage the machine learning lifecycle. NOTE: This course is taught entirely in Python.

Maryam Jahanshahi is a research scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD from the Icahn School of Medicine at Mount Sinai, where she studied molecular regulators of organ size control. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of computation linguistics, machine learning, and behavioral economics methods.

Presentations

The Evolution of Data Science Skill Sets: An analysis using Exponential Family Embeddings Session

In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of job descriptions. The key takeaway is that these models can enrich analysis of specialized datasets.

Alejandro (Alex) Jaimes is Senior Vice president of AI and data science at Dataminr. His work focuses on mixing qualitative and quantitative methods to gain insights on user behavior for product innovation. Alex is a scientist and innovator with 15+ years of international experience in research leading to product impact at companies including Yahoo, KAIST, Telefónica, IDIAP-EPFL, Fuji Xerox, IBM, Siemens, and AT&T Bell Labs. Previously, Alex was head of R&D at DigitalOcean, CTO at AiCure, and director of research and video products at Yahoo, where he managed teams of scientists and engineers in New York City, Sunnyvale, Bangalore, and Barcelona. He was also a visiting professor at KAIST. He has published widely in top-tier conferences (KDD, WWW, RecSys, CVPR, ACM Multimedia, etc.) and is a frequent speaker at international academic and industry events. He holds a PhD from Columbia University.

Presentations

AI for Good at Scale in Real Time: Challenges in Machine Learning and Deep Learning Session

When emergency events occur, social signals and sensor data are generated. In this talk, I will describe how Machine Learning and Deep Learning are applied in processing large amounts of heterogeneous data from various sources in real time, with a particular focus on how such information can be used for emergencies and in critical events for first responders and for other social good use cases.

Dave Josephsen runs the telemetry engineering team at sparkpost. He thinks you’re pretty great.

Presentations

Schema On Read and the New Logging Way Session

This is the story of how Sparkpost Reliability Engineering abandoned ELK for a DIY Schema-On-Read logging infrastructure. We share architectural details and tribulations from our _Internal Event Hose_ data ingestion pipeline project, which uses Fluentd, Kinesis, Parquet and AWS Athena to make logging sane.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Autoscaling Spark on Kubernetes Session

In the Kubernetes world where declarative resources are a first class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice -- we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova & Holden Karau for a fun adventure.

Cross-Cloud Model Training & Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Improving Spark Down Scaling: Or not throwing away all of our work Session

As more workloads move to “severless” like environments, the importance of properly handling downscaling increases.

Rohit Karlupia has been mainly writing high performance server applications, ever since completing his Bachelors of Technology in Computer Science and Engineering from IIT Delhi in 2001. He has deep expertise in the domain of messaging, API gateways and mobile applications. His primary research interests are performance and scalability of cloud applications. At Qubole, his primary focus is making Big Data as a Service, debuggable, scalable and performant. His current work includes SparkLens (open source Spark profiler), GC/CPU aware task scheduling for spark and Qubole Chunked Hadoop File System.

Presentations

Scalability aware autoscaling of spark application Session

Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Scalability aware autoscaling aims to use historical information to make better scaling decisions. In this talk we will talk about (1) Measuring efficiency of autoscaling policies and (2) coming up with more efficient autoscaling policies, in terms of latency and costs.

Software developer at Square working on building engaging experiences on their iOS Point of Sale application. Worked previously at Microsoft building user experiences for Bing, Cortana, and Seeing AI.

Presentations

Deep learning on mobile Session

Bringing the power of convolutional neural networks (CNN) to memory- and power-constrained devices like smartphones and drones.

Until recently, Arun Kejariwal was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Architecture and Algorithms for End-to-End Streaming Data Processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). In this tutorial we shall lead the audience through a journey of the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline - messaging, compute and storage - for real-time data and algorithms to extract insights - e.g., heavy-hitters, quantiles - from data streams.

Model serving via Pulsar Functions Session

In this talk, we shall walk the audience through an architecture whereby models are served in real-time and the models are updated, using Apache Pulsar, without restarting the application at hand. Further, we will describe how Pulsar functions can be applied to support two example use cases, viz., sampling and filtering. We shall lead the audience through a concrete case study of the same.

Sequence-2-Sequence Modeling for Time Series Session

Recently, Sequence-2-Sequence has also been used for applications based on time series data. In this talk, we first overview S2S and the early use cases of S2S. Subsequently, we shall walk through how S2S modeling can be leveraged for the aforementioned use cases, viz., real-time anomaly detection and forecasting.

Seonmin Kim is a senior data risk analyst at LINE where he is a key member of the Trust and Safety team that handles payment fraud and content abuse using data analytics. He has over 9 years of extensive experience in identifying fraud and abuse risk across various business domains. His primary focus is on AI and machine learning for payment fraud and abuse risk.

Presentations

How to mitigate mobile fraud risk by data analytics Session

Kim will provide an introduction to activities that mitigate the risk of mobile payments through various data analytical skills which came out of actual case studies of mobile frauds, along with tree-based machine learning, graph analytics, and statistical approaches.

Mikayla is a software engineer at Google on the Cloud Dataproc team. She helped launch Dataproc’s High Availability mode and the Workflow Templates API. She is currently working on improvements to shuffle and autoscaling.

Presentations

Improving Spark Down Scaling: Or not throwing away all of our work Session

As more workloads move to “severless” like environments, the importance of properly handling downscaling increases.

Cassie Kozyrkov is Google Cloud’s chief decision scientist. Cassie is passionate about helping everyone make better decisions through harnessing the beauty and power of data. She speaks at conferences and meets with leadership teams to empower decision makers to transform their industries through AI, machine learning, and analytics. At Google, Cassie has advised more than a hundred teams on statistics and machine learning, working most closely with research and machine Intelligence, Google Maps, and ads and commerce. She has also personally trained more than 15,000 Googlers (executives, engineers, scientists, and even nontechnical staff members) in machine learning, statistics, and data-driven decision making. Previously, Cassie spent a decade working as a data scientist and consultant. She is a leading expert in decision science, with undergraduate studies in statistics and economics at the University of Chicago and graduate studies in statistics, neuroscience, and psychology at Duke University and NCSU. When she’s not working, you’re most likely to find Cassie at the theatre, in an art museum, exploring the world, playing board games, or curled up with a good novel.

Presentations

Keynote with Cassie Kozyrkov Keynote

Cassie Kozyrkov

S.P.T. Krishnan, PhD is a computer scientist and engineer with 18+ years of professional, research & development experience in Cloud Computing, Big Data Analytics, Machine Learning and Computer Security.

He is recognized as a Google Developer Expert in Google Cloud Platform and an authorized trainer for Google Cloud Platform. Red Hat selected him as “Red Hat Certified Engineer of the Year.” He has architect and developer experience on Amazon Web Services, Google Cloud Platform, OpenStack and Microsoft Azure Platform.

He authored the book "Building Your Next Big Thing with Google Cloud Platform,” and has spoken at both Black Hat and RSA. He is also an adjunct faculty in computer science and taught 500+ university students over 5 years.

He is also a co-founder of Google Developer Group, Singapore. He holds a PhD from the National University of Singapore in Computer Engineering, where he studied the performance characteristics of High Performance Computing algorithms by evaluating them on different multiprocessor architectures.

Presentations

Using AWS Serverless Technologies to Analyze Large Datasets Tutorial

Provides an overview of the latest Big Data and Machine Learning serverless technologies from AWS, and a deep dive into using them to process and analyze two different datasets. The first dataset is publicly available Bureau of Labor Statistics, and the second is Chest X-Ray Image Data.

Mounia Lalmas is a Director of Research at Spotify, and the Head of Tech Research in Personalization. Mounia also holds an honorary professorship at University College London. Before that, she was a Director of Research at Yahoo, where she led a team of researchers working on advertising quality for Gemini, Yahoo native advertising platform. She also worked with various teams at Yahoo on topics related to user engagement in the context of news, search, and user generated content. Her work focuses on studying user engagement in areas such as native advertising, digital media, social media, search, and now music. She has given numerous talks and tutorials on these and related topics. She is also the co-author of a book written as the outcome of her WWW 2013 tutorial on “measuring user engagement”.

Presentations

Music Recommendations (Research) at Spotify Session

The aim of our mission is "to match fans and artists in a personal and relevant way". In this talk, Mounia will describe some of the (research) work we are doing to achieve this, from using machine learning to metric validation. She will describe works done in the context of Home, Search and Voice.

Francesca Lazzeri is an AI and machine learning scientist on the cloud developer advocacy team at Microsoft. Francesca has multiple years of experience as data scientist and data-driven business strategy expert; she is passionate about innovations in big data technologies and the applications of machine learning-based solutions to real-world problems. Her work on these issues covers a wide range of industries, including energy, oil and gas, retail, aerospace, healthcare, and professional services. Previously, she was a research fellow in business economics at Harvard Business School, where she performed statistical and econometric analysis within the Technology and Operations Management Unit and worked on multiple patent data-driven projects to investigate and measure the impact of external knowledge networks on companies’ competitiveness and innovation. Francesca is a mentor for PhD and postdoc students at the Massachusetts Institute of Technology and enjoys speaking at academic and industry conferences to share her knowledge and passion for AI, machine learning, and coding. Francesca holds a PhD in innovation management.

Presentations

Cross-Cloud Model Training & Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Time Series Forecasting with Azure Machine Learning service Tutorial

Time series modeling and forecasting has fundamental importance to various practical domains and, during the past few decades, machine learning model-based forecasting has become very popular in the private and the public decision-making process. In this tutorial, we will walk you through the core steps for using Azure Machine Learning to build and deploy your time series forecasting models.

Mike Lee Williams is a research engineer at Cloudera Fast Forward Labs, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Cloudera’s customers understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Federated learning: machine learning with privacy on the edge Session

Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. In this talk we’ll cover the algorithmic solutions and the product opportunities.

Sun is working in Equinor as an Leading Engineer within Enterprise Data Management. She holds 3 years of Data Management experience from the Norwegian Hydrographic office and 7 years drilling services experience prior to Data Management positions for Statoil since 2008. This includes advisory posistions since 2011 and member in the Blue Book Work Group and Diskos Well Committe. In Enterprise Data Management the focus is now shifted to the whole company. Sun holds an M.Sc in Petroleum GeoScience from NTNU in 1998.

Presentations

Architecting a data platform to support analytic workflows for scientific data Session

In Upstream Oil and Gas, a vast amount of the data requested for analytics projects is “scientific data” - physical measurements about the real world. Historically this data has been managed “library-style” in files - but to provide this data to analytics projects, we need to do something different. Sun and Jane discuss architectural best practices learned from their work with subsurface data.

14 years of experience in the IT universe within different sectors and technologies.
Strong expertise to define and develop a Big Data Architecture to processing batch and streaming approach.

Presentations

The vindication of Big data. How Hadoop is used in Santander UK to defend privacy. Session

Big data is usually regarded as a menace for data privacy. However, with the right principles and mind-set, it can be a game changer to put customers first and consider data privacy an inalienable right. Santander UK applied this model to comply with GDPR by using graph technology, Hadoop, Spark, Kudu to drive data obscuring and data portability, and driving machine learning exploration.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience and enjoys intelligent design and engaging storytelling. He is passionate about data, music, and nature.

Presentations

Building a Serverless Big Data Application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this workshop, we show you how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You will build a big data application using AWS technologies such as S3, Athena, Kinesis, and more

Building a Serverless Big Data Application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this workshop, we show you how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You will build a big data application using AWS technologies such as S3, Athena, Kinesis, and more

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the first day of keynotes.

David Low is currently the Co-founder and Chief Data Scientist at Pand.ai, building AI-powered chatbot to disrupt and shape the booming conversational commerce space with Deep Natural Language Processing. He represented Singapore and National University of Singapore (NUS) in Data Science Game’16 at France and clinched top spot among Asia and America teams. Recently David has been invited as a guest lecturer by NUS to conduct masterclasses on applied Machine Learning and Deep Learning topics. Prior to Pand.ai, he was a Data Scientist with Infocomm Development Authority (IDA) of Singapore.

Throughout his career, David has engaged in data science projects ranging from Manufacturing, Telco, E-commerce to Insurance industry. Some of his works including sales forecast modeling and influencer detection had won him awards in several competitions and was featured on IDA website and NUS publication. Earlier in his career, David was involved in research collaborations with Carnegie Mellon University (CMU) and Massachusetts Institute of Technology (MIT) on separate projects funded by National Research Foundation and SMART. As a pastime activity, he competed on Kaggle and achieved Top 0.2% worldwide ranking.

Presentations

The Unreasonable Effectiveness of Transfer Learning on NLP Session

Transfer Learning has been proven to be a tremendous success in the Computer Vision field as a result of ImageNet competition. In the past months, the Natural Language Processing field has witnessed several breakthroughs with transfer learning, namely ELMo, OpenAI Transformer, and ULMFit. In this talk, David will be showcasing the use of transfer learning on NLP application with SOTA accuracy.

Feng Lu is a software engineer at Google, and also the tech lead and manager for Cloud Composer. He joined Google in 2014 after completed his PhD in UC San Diego where his research work was reported by MIT Technology Review among others. He has a broad interest in cloud and big data analytics.

Presentations

Migrating Apache Oozie Workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems where the former focuses on Apache Hadoop jobs. We see a need to build oozie to Airflow workflow mapping as a part of creating an effective cross-cloud/cross-system solution. This talk aims to introduce an open-source Oozie-to-Airflow migration tool developed at Google.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Over his career, he has been responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on Machine Learning with Kafka-based Streaming Pipelines Tutorial

This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques.

Angie Ma is cofounder and COO of ASI Data Science, a London-based AI tech startup that offers data science as a service, which has completed more than 120 commercial data science projects in multiple industries and sectors and is regarded as the EMEA-based leader in data science. Angie is passionate about real-world applications of machine learning that generate business value for companies and organizations and has experience delivering complex projects from prototyping to implementation. A physicist by training, Angie was previously a researcher in nanotechnology working on developing optical detection for medical diagnostics.

Presentations

AI for managers 2-Day Training

Angie Ma and Jonny Howell offer a condensed introduction to key AI and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

AI for managers (Day 2) Training Day 2

Angie Ma offers a condensed introduction to key AI and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Swetha Machanavajhala is a software engineer for Azure Networking at Microsoft, where she builds tools to help engineers detect and diagnose network issues within seconds. She is very passionate about building products and awareness for people with disabilities and has led several related projects at hackathons, driving them from idea to reality to launching as a beta product and winning multiple awards. Swetha is a co-lead of the Disability Employee Resource Group, where she represents the community of people who are deaf or hard of hearing, and is a part of the ERG chair committee. She is also a frequent speaker at both internal and external events.

Presentations

Inclusive Design: Deep Learning on audio in Azure, identifying sounds in real-time. Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. We will explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.

Mark Madsen is the global head of architecture at Think Big Analytics, where he is responsible for understanding, forecasting, and defining the analytics landscape and architecture. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning and vendors on product management. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

I am a Data Architect and Data Scientist with 13+ years of experience in building extremely large Data Warehouses and Analytical solutions. I have worked extensively on Hadoop, DI and BI Tools, Data Mining and Forecasting, Data Modeling, Master and Metadata Management and Dashboard tools. I am proficient in Hadoop, SAS, R, Informatica, Teradata & Qlikview. I participate on Kaggle Data Mining competitions as a hobby,

Presentations

Scaling Impala - Common Mistakes and Best Practices` Session

Apache Impala is a MPP SQL query engine for planet scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. In this talk, we will discuss how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and antipatterns for end users or BI applications.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Foundations for Successful Data Projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Mastering Stream & Pipelines: Designing and support the nervous system of your company Session

In the world of data it is all about building the best path to support time/quality to value. 80% to 90% of the work is getting the data into the hands and tools that can create value. This talk will take us on a journey of different patterns and solution that can work at the largest of companies.

Sundeep is SVP of Product Development at Gramener, a data science company, where he leads a team of data enthusiasts who tell visual stories of insights from analysis. These are built on Gramex, Gramener’s data science in a box platform. Previously, Sundeep worked at Comcast Cable, NeoTech Solutions, Birlasoft Inc, Wipro Technologies and worked as Consultant for Federal agencies in USA and India. He holds an Electrical Engineering bachelors with an MBA in IT & Marketing.

Presentations

India's Data Dilemma with India Stack Session

Answering simple question of what rights do Indian citizens have over their data is a nightmare. The rollout of India Stack technology based solutions has added fuel to fire. Sundeep explains, with on ground examples, how businesses and citizens are navigating the India Stack ecosystem while dealing with Data privacy, security & Ethics space in India's booming digital economy.

James Malone is a product manager for Google Cloud Platform and manages Cloud Dataproc and Apache Beam (incubating). Previously, James worked at Disney and Amazon. James is a big fan of open source software because it shows what is possible when people come together to solve common problems with technology. He also loves data, amateur radio, Disneyland, photography, running, and Legos.

Presentations

Migrating Apache Oozie Workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems where the former focuses on Apache Hadoop jobs. We see a need to build oozie to Airflow workflow mapping as a part of creating an effective cross-cloud/cross-system solution. This talk aims to introduce an open-source Oozie-to-Airflow migration tool developed at Google.

Founder, CEO and CTO.

David is a serial entrepreneur, with his most previous company – Hexatier/GreenSQL – acquired by Huawei. He was a founder of Precos, Vanadium-soft, GreenCloud, Teridion, Terrasic, and Re-Sec, among others.

Previously a director in Fortinet’s CTO office, he managed information security at Bezeq, the Israeli Telecom.

He has 24 years’ experience in leadership, AI, Cyber security, development and networking and is a veteran of an elite IDF unit.

Named one of the Top-40 Israeli Internet Startup Professionals by TheMarker Magazine and Top 40 under 40 most promising Israeli business professionals by Globes Magazine.

David holds a master’s in computer science from Open University.

Presentations

Signal Processing, Machine Learning & Video Tell the Truth Session

The combination of a mere of a few minutes of video, signal processing, remote heart rate monitoring, machine learning, and data science can identify a person’s emotions, health condition and performance. Financial institutions and potential employers can analyze whether you have good or bad intentions.

Presentations

Jane McConnell is a practice partner for oil and gas within Teradata’s Industrial IoT Group, where she shows oil and gas clients how analytics can provide strategic advantage and business benefits in the multimillions. Jane is also a member of Teradata’s IoT core team, where she sets the strategy and positioning for Teradata’s IoT offerings and works closely with Teradata Labs to influence development of products and services for the industrial space. Originally from an IT background, Jane has also done time with dominant oil industry market players such as Landmark and Schlumberger in R&D, product management, consulting, and sales. In one role or another, she has influenced information management projects for most major oil companies across Europe. She chaired the education committee for the European oil industry data management group ECIM, has written for Forbes, and regularly presents internationally at oil industry events. Jane holds a BEng in information systems engineering from Heriot-Watt University in the UK. She is Scottish and has a stereotypical love of single malt whisky.

Presentations

Architecting a data platform to support analytic workflows for scientific data Session

In Upstream Oil and Gas, a vast amount of the data requested for analytics projects is “scientific data” - physical measurements about the real world. Historically this data has been managed “library-style” in files - but to provide this data to analytics projects, we need to do something different. Sun and Jane discuss architectural best practices learned from their work with subsurface data.

Darragh is a Solution Architect at Kainos, specialising in data engineering. He has been working with data-intensive systems for over a decade and was the founder of Kainos’ Data & Analytics Capability in 2014. He is Kainos’ lead architect for NewDay’s AWS Data Platform. He enjoys working with talented people and like every engineer, loves a technical challenge. In his spare time he is usually up a mountain or in a squash court but lately has developed an unhealthy fascination with unsolved crimes.

Presentations

Cloud-based streams and batches in the PCI - Transforming a Financial Services data infrastructure for the modern era by building a PCI DSS-compliant data platform from the ground up, on AWS. Session

In this session you will learn how we have built a high-performance contemporary data processing platform, from the ground up, on AWS. We will discuss our journey from legacy, onsite, traditional data estate to an entirely cloud-based, PCI DSS-compliant platform.

Michael McCune is a software developer in Red Hat’s Emerging Technology Group, where he develops and deploys application for cloud platforms. He is an active contributor to several radanalytics.io projects and is a core reviewer for the OpenStack API working group. Previously, Michael developed Linux-based software for embedded global positioning systems.

Presentations

Application intelligence: bridging the gap between human expertise and machine learning Session

Artificial intelligence and machine learning are now popularly used terms but how do we make use of these techniques, without throwing away the valuable knowledge of experienced employees. This session will delve into this idea with examples of how distributed machine learning frameworks fit together naturally with business rules management systems.

Hussein joined Google in October 2017 to relaunch the Cloud AI platform products which include Cloud ML Engine, Kubeflow and more to come. Prior to Google, Hussein worked at Facebook from 2012-2017 where he founded Facebook’s AI platform and Applied ML teams which built critical AI solutions and systems for News Feed, Ads, Instagram, Whatsapp, Messenger and many other Facebook products.

Prior to Facebook, Hussein worked on Search and Speech at Bing, Microsoft and received a Masters in Speech Recognition from the University of Cambridge

Presentations

Mass production of AI solutions Session

AI will change how we live in the next 30 years. However, AI is still limited to a small group of companies. Building AI systems is expensive and difficult. But in order to scale the impact of AI across the globe, we need to reduce the cost of building AI solutions? How can we do that? Can we learn from other industries? Yes, we can. The automobile industry went through a similar cycle.

A Big Data Engineer at the Nielsen Marketing Cloud. I specialize in research and development of solutions for big data infrastructures using cutting-edge technologies such as Spark, Kafka and ElasticSearch.

Presentations

Nielsen Presents: Fun with Kafka, Spark and Offset Management Session

Ingesting billions of events per day into our big data stores we need to do it in a scalable, cost-efficient and consistent way. When working with Spark and Kafka the way you manage your consumer offsets has a major implication on data consistency. We will go in depths of the solution we ended up implementing and discuss the working process, the dos and don'ts that led us to its final design.

Cameron is a Senior Computer Science student at Truman State University in Missouri. I am currently a research intern at Google under the Cloud Composer team, and have had 2 internships there previously. I have a passion for open source projects, with a recent interest in Apache Airflow and Apache Oozie.

Presentations

Migrating Apache Oozie Workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems where the former focuses on Apache Hadoop jobs. We see a need to build oozie to Airflow workflow mapping as a part of creating an effective cross-cloud/cross-system solution. This talk aims to introduce an open-source Oozie-to-Airflow migration tool developed at Google.

Robin is a Developer Advocate at Confluent, the company founded by the creators of Apache Kafka, as well as an Oracle Developer Champion and ACE Director Alumnus. His career has always involved data, from the old worlds of COBOL and DB2, through the worlds of Oracle and Hadoop, and into the current world with Kafka. His particular interests are analytics, systems architecture, performance testing and optimization. He blogs at http://cnfl.io/rmoff and http://rmoff.net/ (and previously http://ritt.md/rmoff) and can be found tweeting grumpy geek thoughts as @rmoff. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.

Presentations

Real-time SQL Stream Processing at Scale with Apache Kafka and KSQL Tutorial

In this workshop you will learn the architectural reasoning for Apache Kafka and the benefits of real-time integration, and then build a streaming data pipeline using nothing but your bare hands, Kafka Connect, and KSQL.

The Changing Face of ETL: Event-Driven Architectures for Data Engineers Session

This talk discusses the concepts of events, their relevance to software and data engineers and their ability to unify architectures in a powerful way. It describes why analytics, data integration and ETL fit naturally into a streaming world. There'll be a hands-on demonstration of these concepts in practice and commentary on the design choices made.

Ines is a developer specialising in applications for AI technology. She’s the co-founder of Explosion AI and a core developer of spaCy, one of the most popular libraries for Natural Language Processing, and Prodigy, an annotation tool for radically efficient machine teaching.

Presentations

Practical NLP transfer learning with spaCy and Prodigy Scale Session

In this talk, I'll explain spaCy's new support for efficient and easy transfer learning, and show you how it can kickstart new NLP projects with our new annotation tool, Prodigy Scale.

I worked for the past six years as an engineer on various Adobe Marketing Cloud solutions, where I got to experiment with Mobile, Video and Backend development.
When dealing with Adobe applications serving 23 billion requests per day, some serious muscles need to be flexed. To make it possible to also deploy new versions of our backend applications while the plane is flying, we need extremely precise and reliable tools to do it fast and with minimal human intervention. This is the area that my team is focusing on, offering infrastructure automatization and fast deployments in Adobe Audience Manager.
Outside business hours, I love playing pool and enjoy a good book.

Presentations

Deploying your realtime apps on thousands of servers and still being able to breathe Session

Obtaining servers to run your realtime application has never been easier. Cloud providers have removed the cumbersome process of provisioning new hardware, to suite your needs. What happens though when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers in a fast and reliable way with minimal human intervention? This session addresses this precise topic.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Previously, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive.

Presentations

Running SQL-based workloads in the cloud at 20x-200x Lower Cost Using Apache Arrow Session

Performance and cost are two important considerations in determining optimized solutions for SQL workloads in the cloud. We look at TPC workloads and how they can be accelerated, invisible to client apps. We explore how Apache Arrow, Parquet, and Calcite can be used to provide a scalable, high-performance solution optimized for cloud deployments, while significantly reducing operational costs.

Paco Nathan is known as a “player/coach”, with core expertise in data science, natural language processing, machine learning, cloud computing; 35+ years tech industry experience, ranging from Bell Labs to early-stage start-ups. Co-chair JupyterCon and Rev. Advisor for Amplify Partners, Deep Learning Analytics, Recognai, Data Spartan. Recent roles: Director, Learning Group @ O’Reilly Media; Director, Community Evangelism @ Databricks and Apache Spark. Cited in 2015 as one of the Top 30 People in Big Data and Analytics by Innovation Enterprise.

Presentations

Executive Briefing: Overview of Data Governance Session

Data governance is an almost overwhelming topic. This talk surveys history, themes, plus a survey of tools, process, standards, etc. Mistakes imply data quality issues, lack of availability, and other risks that prevent leveraging data. OTOH, compliance issues aim to preventing risks of leveraging data inappropriately. Ultimately, risk management plays the "thin edge of the wedge" in enterprise.

Dr Sami Niemi has been working on Bayesian inference and machine learning over 10 years and have published peer reviewed papers in astrophysics and statistics. He has delivered machine learning models for e.g. telecommunications and financial services. Sami has built supervised learning models to predict customer and company defaults, 1st and 3rd party fraud, customer complaints, and used natural language processing for probabilistic parsing and matching. He has also used unsupervised learning in a risk based anti-money laundering application. Currently Sami works at Barclays where he leads a team of data scientists building fraud detection models and manages the UK fraud models.

Presentations

Predicting Real-Time Transaction Fraud Using Supervised Learning Session

Predicting transaction fraud of debit and credit card payments in real-time is an important challenge, which state-of-art supervised machine learning models can help to solve. Barclays has been developing and testing different solutions and will show how well different models perform in variety of situations like card present and card not present debit and credit card transactions.

Erik Nordström is a senior software engineer at Timescale, where he focuses on both core database and infrastructure services. Previously, he worked on Spotify’s backend service infrastructure and was a postdoc and research scientist at Princeton, where he focused on networking and distributed systems, including a new end-host network stack for service-centric networking. Erik holds an MSc and PhD from Uppsala in Sweden.

Presentations

Performant time-series data management and analytics with Postgres Session

Requirements of time-series databases include ingesting high volumes of structured data; answering complex, performant queries for both recent & historical time intervals; & performing specialized time-centric analysis & data management. I explain how one can avoid these operational problems by re-engineering Postgres to serve as a general data platform, including high-volume time-series workloads

Kris Nova is a senior developer advocate at Heptio focusing on containers, infrastructure, and Kubernetes. She is also an ambassador for the Cloud Native Computing Foundation. Previously, Kris was a developer advocate and an engineer on Kubernetes in Azure at Microsoft. She has a deep technical background in the Go programming language and has authored many successful tools in Go. Kris is a Kubernetes maintainer and the creator of kubicorn, a successful Kubernetes infrastructure management tool. She organizes a special interest group in Kubernetes and is a leader in the community. Kris understands the grievances with running cloud-native infrastructure via a distributed cloud-native application and recently authored an O’Reilly book on the topic: Cloud Native Infrastructure. Kris lives in Seattle, WA, and spends her free time mountaineering.

Presentations

Autoscaling Spark on Kubernetes Session

In the Kubernetes world where declarative resources are a first class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice -- we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova & Holden Karau for a fun adventure.

Eoin is currently a Lead Data engineer at Newday. For the past couple of years he has worked as part of Newday’s digital transformation, specifically in bringing in and enabling new data capabilities. He previously worked at data analytics firm Dunnhumby, where he worked in several roles across Data, IT and Architecture.

Presentations

Cloud-based streams and batches in the PCI - Transforming a Financial Services data infrastructure for the modern era by building a PCI DSS-compliant data platform from the ground up, on AWS. Session

In this session you will learn how we have built a high-performance contemporary data processing platform, from the ground up, on AWS. We will discuss our journey from legacy, onsite, traditional data estate to an entirely cloud-based, PCI DSS-compliant platform.

Brian O’Neill is the founder and consulting product designer at Designing for Analytics, where he focuses on helping companies design indispensable data products that customers love. Brian’s clients and past employers include Dell EMC, NetApp, TripAdvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JPMorgan Chase, the Future of Music Coalition, and E*TRADE, among others; over his career, he has worked on award-winning storage industry software for Akorri and Infinio. Brian has been designing useful, usable, and beautiful products for the web since 1996. Brian has also brought over 20 years of design experience to various podcasts, meetups, and conferences such as the O’Reilly Strata Conference in New York City and London, England. He is the author of the Designing for Analytics Self-Assessment Guide for Non-Designers as well numerous articles on design strategy, user experience, and business related to analytics. Brian is also an expert advisor on the topics of design and user experience for the International Institute for Analytics. When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage performing as a professional percussionist and drummer. He leads the acclaimed dual-ensemble Mr. Ho’s Orchestrotica, which the Washington Post called “anything but straightforward,” and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival. If you’re at a conference, just look for only guy with a stylish orange leather messenger bag.

Presentations

UX Strategies for Underperforming Enterprise Data Products and Analytics Services Session

Gartner says 85%+ of big data projects will fail, despite the fact your company may have invested millions on engineering implementation. Why are customers and employees not engaging with these products and services? Brian O'Neill explains why a "people first, technology second" mission—a design strategy, in other words—enables the best UX and business outcomes possible.

Cait joined Shazam in November 2013 as VP of Product, Music and Platforms. She is responsible for their hugely successful mobile and web products as well as the music roadmap.
Cait joined Shazam from the BBC where she held a number of roles including Head of Product for Sport. Cait was responsible for the digital development for the BBC’s coverage of the 2012 Olympic Games across web, mobile and connected TV.

Presentations

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: From the Edge to AI - Taking Control of your Data for Fun and Profit Session

It's easier than ever to collect data -- but managing it securely, in compliance with regulations and legal constraints is harder. There are plenty of tools that promise to bring machine learning techniques to your data -- but choosing the right tools, and managing models and applications in compliance with regulation and law is quite difficult.

Laila is currently a lawyer practicing technology and privacy law at GTC Law Professional Corp. She is also a software applications engineer. She previously held positions at ExxonMobil and Capstone Technology where she designed and implemented machine learning (AI) software solutions to optimize industrial processes. She routinely advises both Fortune 100 and start-up clients on all aspects of the development and commercialization of their technology solutions (including big data/predictive modelling/machine learning) in diverse industries including fintech, healthcare, and the automotive industry. She is a steering committee member of the Toronto Machine Learning Symposium and will be a panel member discussing responsible AI innovation in November. She has spoken most recently at the Global Blockchain Conference (“Smart Contract Management & Innovation”), the Healthcare Blockchain in Canada conference (“How Blockchain Can Solve Healthcare Challenges”) and the Linux FinTech Forum (“Smart Money Bets on Open Source Adoption in AI/ML Fintech Applications”). Laila will be faculty for the upcoming Osgoode Certificate in Blockchains, Smart Contracts and the Law (November 2018). Laila holds a B.A.Sc. in Chemical Engineering from the University of Toronto, a M.A.Sc. in Chemical Engineering from the University of Waterloo and a J.D. from the University of Toronto, where she was a law review editor. She is admitted to practice in New York and Ontario. She is also a Certified Information Privacy Professional (Canada) (CIPP/C).

Presentations

Responsible AI Innovation Session

As companies commercialize novel applications of AI in areas such as finance, hiring, and public policy, there is concern that these automated decision-making systems may unconsciously duplicate social biases, with unintended societal consequences. This talk will provide practical advice for companies to counteract such prejudices through a legal and ethics based approach to innovation.

Yves Peirsman is the founder and Natural Language Processing expert at NLP Town. Yves started his career as a PhD student at the University of Leuven and a post-doctoral researcher at Stanford University. Since he made the move from academia to industry, he has gained extensive experience in consultancy and software development for NLP projects in Belgium and abroad.

Presentations

Dealing with Data Scarcity in Natural Language Processing Session

In this age of big data, NLP professionals are all too often faced with a lack of data: written language is abundant, but labelled texts are much harder to get by. In my talk, I will discuss the most effective ways of addressing this challenge: from the semi-automatic construction of labelled training data to transfer learning approaches that reduce the need for labelled training examples.

Nick Pentreath is a principal engineer in IBM’s Center for Open-source Data & AI Technology (CODAIT), where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Building a Secure and Transparent ML Pipeline Using Open Source Technologies Session

The application of AI algorithms in domains such as criminal justice, credit scoring, and hiring holds unlimited promise. At the same time, it raises legitimate concerns about algorithmic fairness. There is a growing demand for fairness, accountability, and transparency from machine learning (ML) systems. In this talk we cover how to build just such a pipeline leveraging open source tools.

Dirk is a Head of Engineering and Data Science at Zalando, Europe’s leading Fashion Platform. Trained as a data scientist he enables his five development teams to revolutionize Online Marketing steering in a fully automated, ROI driven, personalized way. In his spare time Dirk is hacking functional Scala and reading through O’Reilly’s online library 10 books at a time.

Presentations

Insights from Engineering Europe's Largest Marketing Platform for Fashion Session

Case Study from Europe’s leading online fashion platform Zalando about its journey to a scalable, personalized Machine Learning based marketing platform.

Thomas Phelan is cofounder and chief architect of BlueData. Previously, Tom was an early employee at VMware; as senior staff engineer, he was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular pluggable storage architecture. He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, Tom was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64-bit filesystem.

Presentations

Deep learning with TensorFlow and Spark using GPUs and Docker containers Session

Organizations need to keep ahead of their competition by using the latest AI/ML/DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. This session will discuss the effective deployment of such applications in a container environment.

Willem leads the Data Science Platform Team at GO-JEK. His main focus areas are building data and ML platforms, allowing organizations to scale machine learning and drive decision making.

The GO-JEK ML platform supports a wide variety of models and handles over 100 million orders every month. Models include recommendation systems, driver allocation, forecasting, anomaly detection, route selection, and more.

In a previous life Willem founded and sold a networking startup and worked as a software engineer in industrial control systems.

Presentations

Unlocking insights in AI by building a feature platform Session

Features are key to driving impact with AI at all scales. By democratizing the creation, discovery, and access of features through a unified platform, organizations are able to dramatically accelerate innovation and time to market. Find out how GO-JEK, Indonesia's first billion-dollar startup, built a feature platform to unlock insights in AI, and the lessons they learned along the way.

Dan is a Site Reliability Engineer in the Adobe Audience Manager team, that’s lately focused on creating and deploying continuous delivery pipelines for applications within the project. Dealing with all aspects of the automation process from instance provisioning to application deployments. Passionate about technology and recently, about programming in general. Also love playing video games.

Presentations

Deploying your realtime apps on thousands of servers and still being able to breathe Session

Obtaining servers to run your realtime application has never been easier. Cloud providers have removed the cumbersome process of provisioning new hardware, to suite your needs. What happens though when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers in a fast and reliable way with minimal human intervention? This session addresses this precise topic.

Emma Prest is the general manager of DataKind UK, where she handles the day-to-day operations supporting the influx of volunteers and building understanding about what data science can do in the charitable sector. Emma sits on the Editorial Advisory Committee at the Bureau of Investigative Journalism. She was previously a program coordinator at Tactical Tech, providing hands-on help for activists using data in evidence-based campaigns. Emma holds an MA in public policy with a specialization in media, information, and communications from Central European University in Hungary and a degree in politics and geography from the University of Edinburgh, Scotland.

Presentations

Why is it so hard to do AI for Good? Session

DataKind UK has been working in data for good since 2013 working with over 100 uk charities, helping them to do data science for the benefit of their users. Some of those projects have delivered above and beyond expectations - others haven't. In this session Emma and Duncan will talk about how to identify the right data for good projects...

Greg is responsible for driving SQL product strategy as part of Cloudera’s data warehouse product team, including working directly with Impala. Over 20 years, Greg has worked with relational database systems across a variety of roles – including software engineering, database administration, database performance engineering, and most recently product management – providing a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

The Future of Cloud-native Data Warehousing: Emerging Trends and Technologies Session

Data warehouses have traditionally run in the data center and in recent years they have adapted to be more cloud-native. In this talk, we'll discuss a number of emerging trends and technologies that will impact how data warehouses are run both in the cloud and on-prem and share our vision on what that means for architects, administrators, and end users.

Vidya leads product management for Machine Learning at Cloudera. Prior to Cloudera, she has helped build highly successful software portfolios in several industry verticals ranging from Telecom, Healthcare, Energy and IoT. Her experience spans early-stage startups, pre-IPO companies to big enterprises. Vidya has a Masters in Business Administration from Duke University.

Presentations

Starting with the end in mind: learnings from data strategies that work Session

Not surprisingly, there is no single approach to embracing data-driven innovations within any industry vertical. However, there are some enterprises that are doing a better job than others when it comes to establishing a culture, process and infrastructure that lends itself to data-driven innovations. In this talk, we will share some key foundational ingredients that span multiple industries.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Architecture and Algorithms for End-to-End Streaming Data Processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). In this tutorial we shall lead the audience through a journey of the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline - messaging, compute and storage - for real-time data and algorithms to extract insights - e.g., heavy-hitters, quantiles - from data streams.

Model serving via Pulsar Functions Session

In this talk, we shall walk the audience through an architecture whereby models are served in real-time and the models are updated, using Apache Pulsar, without restarting the application at hand. Further, we will describe how Pulsar functions can be applied to support two example use cases, viz., sampling and filtering. We shall lead the audience through a concrete case study of the same.

Duncan Ross is Chief Data Officer at Times Higher Education. Duncan has been a data miner since the mid-1990s. Previously at Teradata, Duncan created analytical solutions across a number of industries, including warranty and root cause analysis in manufacturing and social network analysis in telecommunications. In his spare time, Duncan has been a city councillor, chair of a national charity, founder of an award-winning farmers’ market, and one of the founding directors of the Institute of Data Miners. More recently, he cofounded DataKind UK and regularly speaks on data science and social good.

Presentations

Using data for evil V: the AI strikes back Session

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis (and possibly AI). Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse.

Why is it so hard to do AI for Good? Session

DataKind UK has been working in data for good since 2013 working with over 100 uk charities, helping them to do data science for the benefit of their users. Some of those projects have delivered above and beyond expectations - others haven't. In this session Emma and Duncan will talk about how to identify the right data for good projects...

Nikki Rouda has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Before his current role at Amazon Web Services (AWS), Nikki held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IOT startup.) Nikki has an MBA from Cambridge’s Judge Business School, and a ScB in geophysics from Brown University.

Presentations

Executive Briefing: AWS Technology Trends - Data Lakes and Analytics Session

This talk is about some of the key trends we see in data lakes and analytics, and how they shape the services we offer at AWS. Specific topics include the rise of machine generated data and semi-structured/unstructured data as dominant sources of new data, the move towards serverless, SPI-centric computing, and the growing need for local access to data from users around the world.

Neelesh Srinivas Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka.

Presentations

How do you evolve your data infrastructure? Session

Developing data infrastructure is not trivial and neither is changing it. It takes effort and discipline to make changes that can affect your team. In this talk, we shall learn what we, in Stitch Fix's Data Platform team, do to maintain and innovate our infrastructure for our Data Scientists.

Shioulin Sam is a research engineer at Cloudera Fast Forward Labs. In her previous life, she was an angel investor focusing on women-led startups. She also worked in the investment management industry designing quantitative trading strategies. She holds a PhD in electrical engineering and computer science from the Massachusetts Institute of Technology.

Presentations

Learning with Limited Labeled Data Session

Machine learning requires large datasets - a prohibitive limitation in many real world applications. What if we could build models from scratch that could recognize images using only a handful of labeled examples? In this talk, we will cover algorithmic solutions that enable learning with limited data, and discuss product opportunities.

Mark Samson is a principal systems engineer at Cloudera, helping customers solve their big data problems using modern data platforms based on Hadoop. Mark has 20 years’ experience working with big data and information management software in technical sales, service delivery, and support roles.

Presentations

Information Architecture for a Modern Data Platform Session

It is now possible to build a modern data platform capable of storing, processing and analysing a wide variety of data across multiple public and private Cloud platforms and on-premise data centres. This session will outline an information architecture for such a platform, informed by working with multiple large organisations who have built such platforms over the last 5 years.

Danilo Sato is a polyglot principal consultant with more than fifteen years of experience as an architect, data engineer, developer, and agile coach. Balancing strategy with execution, Danilo helps clients refine their technology strategy while adopting practices to reduce the time between having an idea, implementing it, and running it in production using cloud, DevOps and continuous delivery. Danilo authored DevOps In Practice: Reliable and Automated Software Delivery, is a member of ThoughtWorks’ Office of the CTO, and an experienced international conference speaker.

Presentations

Continuous Intelligence: Moving Machine Learning into Production Reliably Tutorial

In this workshop, we will present how to apply the concept of Continuous Delivery (CD) - which ThoughtWorks pioneered - to data science and machine learning. It allows data scientists to make changes to their models, while at the same time safely integrating and deploying them into production, using testing and automation techniques to release reliably at any time and with a high frequency.

Max Schultze is a Data Engineer currently working on building a Data Lake at Europe’s biggest online fashion retailer, Zalando. His focus lies on building data pipelines at scale of terabytes per day and productionizing Spark and Presto as analytical platforms inside the company. He graduated from the Humboldt University of Berlin, actively taking part in the university’s initial development of Apache Flink.

Presentations

From legacy to cloud: an end to end data integration journey Session

Data Lake implementation at a large scale company, raw data collection, standardized data preparation (e.g. binary conversion, partitioning), user driven analytics and machine learning.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Foundations for Successful Data Projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Ben is a software engineer at Google working on the Dataproc team improving the experience of autoscaling with Spark.

Presentations

Improving Spark Down Scaling: Or not throwing away all of our work Session

As more workloads move to “severless” like environments, the importance of properly handling downscaling increases.

Rosaria Silipo is a Principal Data Scientist at KNIME. Rosaria holds a doctorate degree in bio-engineering and has spent most of her professional life working on data science projects for a number of different customer companies in a number of different fields, such as for example IoT, customer intelligence, financial industry, cybersecurity.

Presentations

Practicing Data Science: A Collection of Case Studies Session

This is a collection of past data science projects. While the structure is often similar - data collection, data transformation, model training, deployment - each one of them has needed some special trick. It was either the change in perspective or a particular techniques to deal with special case and special business questions the turning point in implementing the data science solution.

Dr. Alkis Simitsis is a Chief Scientist Cyber Security Analytics with Micro Focus. He has more than 15 years of experience in multiple roles building innovative information and data management solutions in areas like real-time business intelligence, security, massively parallel processing, systems optimization, data warehousing, graph processing, and web services. Alkis holds 26 U.S. patents and has filed over 50 patent applications in the U.S. and worldwide, has published more than 100 papers in refereed international journals and conferences (top publications cited 5000+ times), and frequently serves in various roles in program committees of top-tier international scientific conferences. He is an IEEE senior member and a member of ACM.

Presentations

A Magic 8-Ball for Optimal Cost and Resource Allocation for the Big Data Stack Session

Cost and resource provisioning are critical components of the big data stack. A magic 8-ball for the big data stack would give an enterprise a glimpse into its future needs and would enable effective and cost-efficient project and operational planning. This talk covers how to build that magic 8-ball, a decomposable time-series model, for optimal cost and resource allocation for the big data stack.

Rebecca Simmonds is a senior software engineer at Red Hat. Here she is part of an emerging technology group, which comprises of both data scientists and developers. She completed a PhD at Newcastle University, in which she developed a platform for scalable, geospatial and temporal analysis of the Twitter data. After this she moved to a small startup company as a Java developer creating solutions to improve performance for a CV analyser. She has a keen interest in architecture design and data analysis, which she is furthering at Red Hat with Openshift and ML research.

Presentations

Application intelligence: bridging the gap between human expertise and machine learning Session

Artificial intelligence and machine learning are now popularly used terms but how do we make use of these techniques, without throwing away the valuable knowledge of experienced employees. This session will delve into this idea with examples of how distributed machine learning frameworks fit together naturally with business rules management systems.

Animesh Singh is an STSM and lead for IBM Watson and Cloud Platform, where he leads machine learning and deep learning initiatives on IBM Cloud and works with communities and customers to design and implement deep learning, machine learning, and cloud computing frameworks. He has a proven track record of driving design and implementation of private and public cloud solutions from concept to production. In his decade-plus at IBM, Animesh has worked on cutting-edge projects for IBM enterprise customers in the telco, banking, and healthcare Industries, particularly focusing on cloud and virtualization technologies, and led the design and development first IBM public cloud offering.

Presentations

Building a Secure and Transparent ML Pipeline Using Open Source Technologies Session

The application of AI algorithms in domains such as criminal justice, credit scoring, and hiring holds unlimited promise. At the same time, it raises legitimate concerns about algorithmic fairness. There is a growing demand for fairness, accountability, and transparency from machine learning (ML) systems. In this talk we cover how to build just such a pipeline leveraging open source tools.

A data scientist and entrepreneur focused on building intelligent systems to collect information and enable better decisions, Peter Skomoroch is currently cofounder and CEO of SkipFlag (recently acquired by Workday). Pete specializes in solving hard algorithmic problems, leading cross-functional teams, and developing engaging products powered by data and machine learning. Previously, he applied his skills to the consumer internet space at LinkedIn, the world’s largest professional network, where he was an early member of the data science team. As principal data scientist, he led data science teams focused on reputation, search, inferred identity, and building data products. He was also the creator of LinkedIn Skills and LinkedIn Endorsements.

Presentations

Executive Briefing: Why Managing Machines is Harder Than You Think Session

Companies that understand how to apply machine intelligence will scale and win their respective markets over the next decade. Others will fail to ship successful AI products that matter to customers. This talk describes how to combine product design, machine learning, and executive strategy to create a business where every product interaction benefits from your investment in machine intelligence.

Guoqiong Song is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She is engaged in developing and optimizing distributed deep learning frameworks on Apache Spark.

Presentations

LSTM-Based Time Series Anomaly Detection Using Analytics Zoo for Spark and BigDL Session

Collecting and processing massive time series data (e.g., logs, sensor readings, etc.), and detecting the anomalies in real time is critical for many emerging smart systems, such as industrial, manufacturing, AIOps, IoT, etc. This talk will share how to detect anomalies of time series data using Analytics Zoo and BigDL at scale on a standard Spark cluster.

Raghotham Sripadraj is Principal Data Scientist at Treebo Hotels. Previously, he was cofounder and data scientist at Unnati Data Labs, where he was building end-to-end data science systems in the fields of fintech, marketing analytics, and event management. Raghotham is also a mentor for data science on Springboard. Previously, at Touchpoints Inc., he single-handedly built a data analytics platform for a fitness wearable company; at SAP Labs, he was a core part of what is currently SAP’s framework for building web and mobile products, as well as a part of multiple company-wide events helping to spread knowledge both internally and to customers.

Drawing on his deep love for data science and neural networks and his passion for teaching, Raghotham has conducted workshops across the world and given talks at a number of data science conferences. Apart from getting his hands dirty with data, he loves traveling, Pink Floyd, and masala dosas.

Presentations

Deep Learning for Fonts Session

Deep Learning has enabled massive breakthroughs in offbeat tracks and has enabled better understanding of how an artist paints, how an artist composes music and so on. As part of Nischal & Raghotham’s loved project - Deep Learning for Humans, they want to build a font classifier and showcase to masses how fonts : * Can be classified * Understand how and why two or more fonts are similar

Scott Stevenson is a data scientist at ASI Data Science, a London artificial intelligence startup providing bespoke data science consultancy services, where he builds scalable data science and machine learning tools and infrastructure. Scott holds a DPhil in Particle Physics from the University of Oxford, and before joining ASI carried out research at CERN and Stanford University, and developed computational models for the UK’s National Nuclear Laboratory.

Presentations

Deep learning for speech synthesis: the good news, the bad news, and the fake news Session

Modern deep learning systems allow us to build speech synthesis systems with the naturalness of a human speaker. Whilst there are myriad benevolent applications, this also ushers in a new era of fake news. This talk will explore the danger of such systems, as well as how deep learning can also be used to build countermeasures to protect against political disinformation.

Ravi is Lead Data Engineer at GoJek. He is building resilient and scalable data infrastructure across all of GO-JEK’s 18+ products that help millions of Indonesians commute, shop, eat and pay, daily.

Presentations

Data Infrastructure at GoJek Session

At GO-JEK, we build products that help millions of Indonesians commute, shop, eat and pay, daily. The Data team is responsible to create resilient and scalable data infrastructure across all of GO-JEK’s 18+ products. This involves building distributed big data infrastructure, real-time analytics and visualization pipelines for billions of data points per day.

Václav Surovec was born in 1988 and he lives in Prague. He works since 2014 in T-Mobile CZ, currently as Senior Big Data engineer. He is co-managing the Big Data department with more than 45 people co-leading several projects focused on Hadoop and Big Data.

Presentations

Data Science in Deutsche Telekom - Predicting global travel patterns and network demand Session

The knowledge of location and travel patterns of customers is important for many companies. One of them is a German telco service operator Deutsche Telekom. Commercial Roaming project using Cloudera Hadoop helped the company to better analyze the behavior of its customers from 13 countries, in a very secure way, to be able to provide better predictions and visualizations for the high management.

Anna is an engineering manager at Cloudera where she established and manages the Data Interoperability team. As a software engineer at Cloudera she has worked on Apache Sqoop. Anna cares about enabling people to build high quality software in a sustainable environment. Before her time at Cloudera she worked on Risk Management systems at Morgan Stanley.

Presentations

Picking Parquet: Improved Performance for Selective Queries in Impala, Hive, and Spark Session

The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. We will cover the technical details of the design and its implementation, and we will give practical tips to help data architects leverage these new capabilities in their schema design. Finally, we will show performance results for common workloads.

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial for scalable NLP using the highly performant, highly scalable open-source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Spark NLP in Action: How Indeed Applies NLP to Standardize Resume Content at Scale Session

In this talk you will learn how to use Spark NLP and Apache Spark to standardize semi-structured text. You will see how Indeed standardizes resume content at scale.

Presentations

Keynote with Michael Tidmarsh Keynote

Michael Tidmarsh,

Amy Unruh is a developer programs engineer for the Google Cloud Platform, where she focuses on machine learning and data analytics as well as other Cloud Platform technologies. Amy has an academic background in CS/AI and has also worked at several startups, done industrial R&D, and published a book on App Engine.

Presentations

Serverless Machine Learning with TensorFlow, Part I Tutorial

This tutorial provides an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hand-ons labs, you’ll learn machine learning (ML) and TensorFlow concepts, and develop skills in developing, evaluating, and productionizing ML models.

Serverless Machine Learning with TensorFlow, Part II Tutorial

This tutorial provides an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hand-ons labs, you’ll learn machine learning (ML) and TensorFlow concepts and develop skills in developing, evaluating, and productionizing ML models.

Sandeep Uttamchandani is the hands-on Chief Data Architect at Intuit. He is currently leading the Cloud transformation of the Big Data Analytics, ML, and Transactional platform used by 3M+ Small Business Users for financial accounting, payroll, and billions of dollars in daily payments. Prior to Intuit, Sandeep has played various engineering roles at VMware, IBM, as well as founding a startup focused on ML for managing Enterprise systems. Sandeep’s experience uniquely combines building Enterprise data products and operational expertise in managing petabyte scale data and analytics platforms in production for IBM’s Federal and Fortune 100 customers. Sandeep has received several excellence awards, and over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, USENIX. Sandeep is a regular speaker at academic institutions, guest lectures for university courses, as well as conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as Program Committee Member for systems and data conferences, and the past associate editor for ACM Transactions on Storage. He blogs on LinkedIn and Wrong Data Fabric (his personal blog). Sandeep is a Ph.D. in Computer Science from University of Illinois at Urbana-Champaign.

Presentations

Half correct and Half wrong tribal data knowledge: Our 3 patterns to sanity! Session

Teams today rely on tribal data dictionaries which is a mixed bag w.r.t. correctness -- some datasets have accurate attribute details, while others are incorrect & outdated. This significantly impacts productivity of analysts & scientists. Existing tools for data dictionary are manually updated and difficult to maintain. This talk covers 3 patterns we have deployed to manage data dictionaries.

Vinithra Varadharajan is a senior engineering manager in the cloud organization at Cloudera, where she is responsible for the cloud portfolio products, including Altus Data Engineering, Altus Analytic Database, Altus SDX, and Cloudera Director. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Nanda Vijaydev is director of solution management at BlueData, where she leverages technologies like Hadoop, Spark, Python, and TensorFlow to build solutions for enterprise analytics and machine learning use cases. Nanda has 10 years of experience in data management and data science. Previously, she worked on data science and big data projects in multiple industries, including healthcare and media; was a principal solutions architect at Silicon Valley Data Science; and served as director of solutions engineering at Karmasphere. Nanda has an in-depth understanding of the data analytics and data management space, particularly in the areas of data integration, ETL, warehousing, reporting, and machine learning.

Presentations

Deep learning with TensorFlow and Spark using GPUs and Docker containers Session

Organizations need to keep ahead of their competition by using the latest AI/ML/DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. This session will discuss the effective deployment of such applications in a container environment.

Lars is a software engineer at Cloudera. He has worked on various parts of Apache Impala including crash handling, its Parquet scanners, and scan range scheduling. Most recently he worked on integrating Kudu’s RPC framework into Impala. Before his time at Cloudera he worked on various databases at SAP.

Presentations

Picking Parquet: Improved Performance for Selective Queries in Impala, Hive, and Spark Session

The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. We will cover the technical details of the design and its implementation, and we will give practical tips to help data architects leverage these new capabilities in their schema design. Finally, we will show performance results for common workloads.

Scaling Impala - Common Mistakes and Best Practices` Session

Apache Impala is a MPP SQL query engine for planet scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. In this talk, we will discuss how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and antipatterns for end users or BI applications.

Kai Waehner is a technology evangelist at Confluent. Kai’s areas of expertise include big data analytics, machine learning, deep learning, messaging, integration, microservices, the internet of things, stream processing, and the blockchain. He is regular speaker at international conferences such as JavaOne, O’Reilly Software Architecture, and ApacheCon and has written a number of articles for professional journals. Kai also shares his experiences with new technologies on his blog.

Presentations

Unleashing Apache Kafka and TensorFlow in Hybrid Architectures Session

How can you leverage the flexibility and extreme scale in public cloud combined with Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures, which span multiple public clouds or bridge your on-premise data centre to cloud? Join this talk to learn how to apply technologies such as TensorFlow with Kafka’s open source ecosystem for machine learning infrastructures

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Dean Wampler, Ph.D. is the VP of Fast Data Engineering at Lightbend, where he leads the development of the Lightbend Fast Data Platform, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Executive Briefing: What it takes to use machine learning in fast data pipelines Session

Your team is building Machine Learning capabilities. I'll discuss how you can integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed. There are big challenges. How do you build long-running services that are very reliable and scalable? How do you combine a spectrum of very different tools, from data science to operations?

Hands-on Machine Learning with Kafka-based Streaming Pipelines Tutorial

This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques.

Jason Wang is a software engineer at Cloudera focusing on the cloud.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Mr. Moshe Wasserblat is currently the Natural Language Processing and Deep Learning Research Group Manager for Intel’s Artificial Intelligence Products Group. In his former role, he has been with NICE Systems for more than 17 years, where he founded and led the Speech/Text Analytics Research team. His interests are in the field of speech processing and natural language processing. He was the co-founder coordinator of the EXCITEMENT FP7 ICT program and served as organizer and manager of several initiatives, including many Israeli Chief Scientist programs. He has filed more than 60 patents in the field of Language Technology and also has several publications in international conferences and journals. His areas of expertise include: Speech Recognition, Conversational Natural Language Processing, Emotion Detection, Speaker Separation, Speaker Recognition, Deep Learning, and Machine Learning.

Presentations

NLP Architect by Intel's AI-Lab Session

Moshe Wasserblat presents an overview of NLP Architect, an open source DL NLP library that provides SOTA NLP models making it easy for researchers to implement NLP algorithms and for data scientists to build NLP based solution for extracting insight from textual data to improve business operations.

Dr Sophie Watson is a software engineer in an Emerging Technology Group at Red Hat, where she applies her data science and statistics skills to solving business problems and informing next-generation infrastructure for intelligent application development. She has a background in mathematics and holds a PhD in Bayesian statistics, in which she developed algorithms to estimate intractable quantities quickly and accurately.

Presentations

Learning "Learning to Rank" Session

Identifying relevant documents quickly and efficiently enhances both user experience and business revenue every day. Sophie Watson demonstrates how to implement Learning to Rank algorithms and provides you with the information you need to implement your own successful ranking system.

Thomas is Software Engineer, Streaming Platform at Lyft, working with Apache Flink. He is also a PMC member of Apache Apex and Apache Beam and has contributed to several more of the ecosystem projects. Thomas is a frequent speaker at international big data conferences and author of the book Learning Apache Apex.

Presentations

Streaming at Lyft Session

Fast data and stream processing are essential for making Lyft rides a good experience for passengers and drivers. Our systems need to track and react to event streams in real-time, to update locations, compute routes and estimates, balance prices and more. The streaming platform at Lyft powers these use cases with development frameworks and deployment stack that are based on Apache Flink and Beam.

Charlotte Werger works at the intersection of Artificial Intelligence and Finance. After completing her PhD at the European University Institute in Florence, she worked in quantitative hedge funds at BlackRock and Man AHL in London as a portfolio manager and quant researcher. There she was part of an early movement in asset management that initiated the application of machine learning models to predicting financial markets. Having developed a broader interest in AI and Machine learning, she then worked for ASI Data Science, a cutting edge AI start-up which helps its clients by building AI applications and software. Currently Charlotte is working in Amsterdam, as Lead Data Scientist at Van Lanschot Kempen, a wealth manager and private bank. Here she is challenged to transform this traditional company to a cutting edge data-driven one. Outside of work she is internationally active in the field of Data Science and AI education and advisory. She is an instructor for Datacamp, mentors data science students on the Springboard platform and has an advisory role at Ryelore AI.

Presentations

Fraud Detection at a Financial Institution using Unsupervised Learning & Text mining Session

This talk discusses a best practice use case for detecting fraud at a financial institution. Where traditional systems fall short, machine learning models can provide a solution. Sifting through large amounts of transaction data, external hit lists, and unstructured text data we managed to build a dynamic and robust monitoring system that successfully detects unwanted client behavior.

Elliot is a principal engineer at Hotels.com in London where he designs tooling and platforms in the big data space. Prior to this Elliot worked in Last.fm’s data team, developing services for managing large volumes of music metadata.

Presentations

Herding Elephants: Seamless data access in a multi-cluster clouds Session

Expedia Group is a travel platform with an extensive portfolio including Expedia.com and Hotels.com. We like to give our data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. We'll explain how we built a unified virtual data lake on top of our many heterogeneous and distributed data platforms.

Mutant Tests Too: The SQL Session

Hotels.com describe approaches for applying software engineering best practices to SQL-based data applications in order to improve maintainability and data quality. Using open source tools we show how to build effective test suites for Apache Hive code bases. We also present Mutant Swarm, a mutation testing tool we’ve developed to identify weaknesses in tests and to measure SQL code coverage.

Dr. Arif Wider is a lead consultant and developer at ThoughtWorks Germany, where he enjoys building scalable applications, teaches Scala, and consults at the intersection of data science and software engineering. Before joining ThoughtWorks he has been in research with a focus on data synchronisation, bidirectional transformations, and domain-specific languages.

Presentations

Continuous Intelligence: Keeping your AI Application in Production Session

Machine learning can be challenging to deploy and maintain. Data change, and both models and the systems that implement them must be able to adapt. Any delays moving models from research to production means leaving your data scientists' best work on the table. In this talk, we explore continuous delivery (CD) for AI/ML, and explore case studies for applying CD principles to data science workflows.

Alicia is an advocate for Google Cloud. Previously she spent six years as a program manager and through building, managing, and measuring programs and processes, she fell in love with data science. Known to hang out in spreadsheets surrounded by formulas, she also uses machine learning, SQL, and visualizations to help solve problems and tell stories.

Presentations

Building custom machine learning models for production, without ML expertise DCS

In this talk, Alicia Williams will share how two media companies followed this path to organize content and make it accessible around the world. Along the way, we will talk about the business problems they solved with ML, demonstrate the ease-of-use of the tools themselves, and show the value that ML has brought in each case.

Christoph Windheuser studied computer science in Bonn (Germany), Pittsburgh (USA) and Paris (France). He made his PhD in Speech Recognition with Artificial Neural Networks. After his scientific career he worked in different positions in the IT industry (i.a. SAP, Capgemini Consulting). Today, Christoph is the Global Head of Intelligent Empowerment at ThoughtWorks Inc. and responsible for ThoughtWorks positioning on Data Management, Machine Learning and Artificial Intelligence.

Presentations

Continuous Intelligence: Moving Machine Learning into Production Reliably Tutorial

In this workshop, we will present how to apply the concept of Continuous Delivery (CD) - which ThoughtWorks pioneered - to data science and machine learning. It allows data scientists to make changes to their models, while at the same time safely integrating and deploying them into production, using testing and automation techniques to release reliably at any time and with a high frequency.

Mingxi Wu is the vice president of engineering at TigerGraph, a Silicon Valley-based startup building a world-leading real-time graph database. Over his career, Mingxi has focused on database research and data management software. Previously, he worked in Microsoft’s SQL Server group, Oracle’s Relational Database Optimizer group, and Turn Inc.’s Big Data Management group. Lately, his interest has turned to building an easy-to-use and highly expressive graph query language. He has won research awards from the most prestigious publication venues in database and data mining, including SIGMOD, KDD, and VLDB and has authored five US patents with three more international patents pending. Mingxi holds a PhD from the University of Florida, specializing in both database and data mining.

Presentations

Eight Prerequisites of a Graph Query Language Session

Graph query language is the key to unleash the value from connected data. In this talk, we point out 8 prerequisites of a practical graph query language concluded from our 6 years experience in dealing with real world graph analytical use cases. And compare GSQL, Gremlin, Cypher and Sparql in this regard.

Tony Wu manages the Altus core engineering team at Cloudera. Previously, Tony was a team lead for the partner engineering team at Cloudera. He is responsible for Microsoft Azure integration for Cloudera Director.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Chendi Xue is a software engineer from Intel SSG Data Analytics Team. She has five years’ experience in Linux cloud storage system development, optimization and benchmark, including Ceph benchmark and tuning, Spark on disaggregate storage performance tuning and optimization, and HDCS development.

Presentations

Bigdata analytics on the public cloud: Challenges and opportunities Session

Introduce the challenges of migrating bigdata analytics workloads to public cloud - like performance lost, and missing features. Show case how to the new in memory data accelerator leveraging persistent memory and RDMA NICs can resolve this issues and enables new opportunities for bigdata workloads on the cloud.

A Big Data Tech Lead at the Nielsen Marketing Cloud. I have been dealing with Big Data challenges for the past 6 years, using tools like Spark, Druid, Kafka, and others.
I’m keen about sharing my knowledge and have presented my real-life experience in various forums in the past (e.g meetups, conferences, etc.).

Presentations

Stream, Stream, Stream: Different Streaming methods with Spark and Kafka Session

At Nielsen Marketing Cloud, we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner. In this talk, we will discuss how we continuously transform our data infrastructure to support these goals.

Alexis Yelton is a data scientist at Indeed. She has a Ph.D. in bioinformatics and did postdoctoral work building models to predict gene function and explain ecosystem function. Since then she has been focusing on building machine learning models for software products. She has been working with Spark since version 1.6 and has recently moved into the NLP space.

Presentations

Spark NLP in Action: How Indeed Applies NLP to Standardize Resume Content at Scale Session

In this talk you will learn how to use Spark NLP and Apache Spark to standardize semi-structured text. You will see how Indeed standardizes resume content at scale.

Jian Zhang is an software engineer manager at Intel, he and his team primarily focused on Open Source Storage development and optimizations on Intel platforms, and build reference solutions for customers. He has 10 years of experiences on performance analysis and optimization for many open source projects like Xen, KVM, Swift and Ceph, HDFS and benchmarking workloads like SPEC-, TPC. Jian has a master’s degree in Computer Science and Engineering of Shanghai Jiaotong university.

Presentations

Bigdata analytics on the public cloud: Challenges and opportunities Session

Introduce the challenges of migrating bigdata analytics workloads to public cloud - like performance lost, and missing features. Show case how to the new in memory data accelerator leveraging persistent memory and RDMA NICs can resolve this issues and enables new opportunities for bigdata workloads on the cloud.

Weifeng Zhong is a research fellow in economic policy studies at the American Enterprise Institute, where his research focuses on Chinese economic issues and political economy. His recent work has been on the application of text-analytic and machine-learning techniques to political economy issues such as the US presidential election, income inequality, and predicting policy changes in China. He has been published in a variety of scholarly journals, including the Journal of Institutional and Theoretical Economics. In the popular press, his writings have appeared in the Financial Times, Foreign Affairs, The National Interest, and Real Clear Politics, among others. He has a Ph.D. and an M.Sc. in managerial economics and strategy from Northwestern University. He also holds M.Econ. and M.Phil. degrees in economics from the University of Hong Kong and a B.A. in business administration from Shantou University in China.

Presentations

Reading China: Predicting Policy Change with Machine Learning Session

We developed a machine learning algorithm to “read” the People’s Daily — the official newspaper of the Communist Party of China — and predict changes in China’s policy priorities using only the information in the newspaper. The output of this algorithm, which we call the Policy Change Index (PCI) of China, turns out to be a leading indicator of the actual policy changes in China since 1951.

Yuan Zhou is a Senior Software Development Engineer in the Software and Service Group for Intel Corporation, working in the Opensource Technology Center team primarily focused on Bigdata Storage Software. He has been working in Databases, Virtualization and Cloud computing for most of his 7+ year career at Intel.

Presentations

Bigdata analytics on the public cloud: Challenges and opportunities Session

Introduce the challenges of migrating bigdata analytics workloads to public cloud - like performance lost, and missing features. Show case how to the new in memory data accelerator leveraging persistent memory and RDMA NICs can resolve this issues and enables new opportunities for bigdata workloads on the cloud.

Xiaoyong Zhu is a senior data scientist at Microsoft, where he focuses on distributed machine learning and its applications.

Presentations

Inclusive Design: Deep Learning on audio in Azure, identifying sounds in real-time. Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. We will explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.