Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Speakers

Hear from innovative CxOs, talented data practitioners, and senior engineers who are leading the data industry. More speakers will be announced; please check back for updates.

Filter

Search Speakers

Peter Aiken is an acknowledged Data Management (DM) authority. As a practicing data consultant, professor, author and researcher, he has studied DM for more than 30 years. International recognition has come from assisting more than 150 organizations in 30 countries including some of the world’s most important. He is a dynamic presence at events and author of 10 books and multiple publications, including his latest on Data Strategy. Peter also hosts the longest running and most successful webinar series dedicated to DM (hosted by dataversity.net). In 1999, he founded Data Blueprint Inc, a consulting firm that helps organizations leverage data for profit, improvement, competitive advantage and operational efficiencies. He is also Associate Professor of Information Systems at Virginia Commonwealth University (VCU), past President of the International Data Management Association (DAMA-I) and Associate Director of the MIT International Society of Chief Data Officers. Peter also hosts the longest running and most successful webinar series dedicated to DM (hosted by dataversity.net). In 1999, he founded Data Blueprint Inc, a consulting firm that helps organizations leverage data for profit, improvement, competitive advantage and operational efficiencies. He is also Associate Professor of Information Systems at Virginia Commonwealth University (VCU), past President of the International Data Management Association (DAMA-I) and Associate Director of the MIT International Society of Chief Data Officers.

Presentations

Your Data Strategy: It Should Be Concise, Actionable, and Understandable by Business and IT! Tutorial

The presents a more operational perspective on the use of data strategy that is especially useful for organizations just getting started with data

Alasdair Allan is a scientist and researcher who has authored over eighty peer reviewed papers, eight books, and has been involved with several standards bodies. Originally an astrophysicist he now works as a consultant and journalist, focusing on open hardware, machine learning, big data, and emerging technologies — with expertise in electronics, especially wireless devices and distributed sensor networks, mobile computing, and the "Internet of Things.” He runs a small consulting company, and has written for Make: Magazine, Motherboard/VICE, Hackaday, Hackster.io, and the O’Reilly Radar. In the past he has mesh networked the Moscone Center, caused a U.S. Senate hearing, and contributed to the detection of what was—at the time—the most distant object yet discovered.

Presentations

Executive Briefing: The Intelligent Edge and the Demise of Big Data? Session

A arrival of new generation of smart embedded hardware may cause the demise of large scale data harvesting. In its place smart devices will allow us process data at the edge, allowing us to extract insights from the data without storing potentially privacy and GDPR infringing data. The current age where privacy is no longer "a social norm" may not long survive the coming of the Internet of Things.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Professional Kafka Development 2-Day Training

Takes a participant through an in-depth look at Apache Kafka. We show how Kafka works and how to create real-time systems with it. It shows how to create consumers and publishers in Kafka. The we look at Kafka’s ecosystem and how each one is used. We show how to use Kafka Streams, Kafka Connect, and KSQL.

Professional Kafka Development (Day 2) Training Day 2

Takes a participant through an in-depth look at Apache Kafka. We show how Kafka works and how to create real-time systems with it. It shows how to create consumers and publishers in Kafka. The we look at Kafka’s ecosystem and how each one is used. We show how to use Kafka Streams, Kafka Connect, and KSQL.

Eitan Anzenberg is the Chief Data Scientist at Flowcast AI, a seed stage fintech startup in San Francisco. He leads the data science efforts including machine learning explanations, interpretability and what-if scenario analysis. His background is in machine learning, statistical learning and programming. Eitan obtained his PhD of Physics from Boston University in 2012 and finished his Postdoc from Lawrence Berkeley National Lab in 2014.

Presentations

Explainable Machine Learning in Fintech Session

Machine learning applications balance interpretability and performance. Linear models provide formulas to directly compare the influence of the input variables, while non-linear algorithms produce more accurate models. We utilize "what-if" scenarios to calculate the marginal influence of features per prediction and compare with standardized methods such as LIME.

Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Presentations

A Magic 8-Ball for Optimal Cost and Resource Allocation for the Big Data Stack Session

Cost and resource provisioning are critical components of the big data stack. A magic 8-ball for the big data stack would give an enterprise a glimpse into its future needs and would enable effective and cost-efficient project and operational planning. This talk covers how to build that magic 8-ball, a decomposable time-series model, for optimal cost and resource allocation for the big data stack.

Jason Bell is a machine learning engineer specializing in high-volume streaming systems, big data solutions, and machine learning applications. Jason was section editor for Java Developer’s Journal, has contributed to IBM developerWorks on autonomic computing, and is the author of Machine Learning: Hands On for Developers and Technical Professionals.

Presentations

Learning how to perform ETL data migrations with open source tool Embulk. Session

The Embulk data migration tool offers a convenient way to load data in to a variety of systems with basic configuration. This talk gives an overview of the Embulk tool and shows some common data migration scenarios that a data engineer could employ using the tool.

Francine Bennett is a data scientist and the CEO and cofounder of Mastodon C, a group of Agile big data specialists who offer the open source Hadoop-powered technology and the technical and analytical skills to help organizations to realize the potential of their data. Before founding Mastodon C, Francine spent a number of years working on big data analysis for search engines, helping them to turn lots of data into even more money. She enjoys good coffee, running, sleeping as much as possible, and exploring large datasets.

Presentations

Using data for evil V: the AI strikes back Session

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis (and possibly AI). Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse.

Daniel works in the Developer Relations team at Google. With more than fifteen years of experience in the software industry, Daniel has held positions at companies such as Ericsson and Opera Software. Daniel holds a Bachelor’s degree in Computer Science from Uppsala University. He lives in Stockholm and likes to spend his spare time freediving.

Presentations

Processing 10M samples/second to drive smart maintenance in complex IIoT systems Session

Learn how Cognite is developing IIoT smart maintenance systems that can process 10M samples/second from thousands of sensors. We’ll review an architecture designed for high performance, robust streaming sensor data ingest and cost-effective storage of large volumes of time series data, best practices for aggregation and fast queries, and achieving high-performance with machine learning.

I am currently working on query optimizations and resource utilization in Apache Spark at Qubole.

Presentations

Scalability aware autoscaling of spark application Session

Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Scalability aware autoscaling aims to use historical information to make better scaling decisions. In this talk we will talk about (1) Measuring efficiency of autoscaling policies and (2) coming up with more efficient autoscaling policies, in terms of latency and costs.

Pradeep is a Big Data Engineer at Hotels.com in London where he builds and manages cloud infrastructure and core services like Apiary. Pradeep has worked in the big data space for the last 7 years, building large scale platforms.

Presentations

Herding Elephants: Seamless data access in a multi-cluster clouds Session

Expedia Group is a travel platform with an extensive portfolio including Expedia.com and Hotels.com. We like to give our data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. We'll explain how we built a unified virtual data lake on top of our many heterogeneous and distributed data platforms.

I am a Program Manager working on Microsoft Azure Machine Learning – building an exciting Machine Learning service that will make it easy for all data scientists and ML engineers to create and deploy robust, scalable and highly available machine learning web services on the cloud.

Presentations

Time Series Forecasting with Azure Machine Learning service Tutorial

Time series modeling and forecasting has fundamental importance to various practical domains and, during the past few decades, machine learning model-based forecasting has become very popular in the private and the public decision-making process. In this tutorial, we will walk you through the core steps for using Azure Machine Learning to build and deploy your time series forecasting models.

Wojciech Biela is a co-founder of Starburst and is responsible for product development. He has a background of over 14 years of building products and running engineering teams.

Previously Wojciech was the Engineering Manager at the Teradata Center for Hadoop, running the Presto engineering operations in Warsaw, Poland. Prior to that, back in 2011, he built and ran the Polish engineering team, a subsidiary of Hadapt Inc., a pioneer in the SQL-on-Hadoop space. Hadapt was acquired by Teradata in 2014. Earlier, Wojciech built and lead teams on multi-year projects, from custom big e-commerce & SCM platforms to PoS systems.

Wojciech holds a M.S. in Computer Science from the Wroclaw University of Technology.

Presentations

Presto. Cost-Based Optimizer for interactive SQL-on-Anything Session

Presto is a popular open source distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3/Azure ADSL, RDBMS, no-SQL, etc). Recently Starburst has contributed the Cost-Based Optimizer for Presto which brings a great performance boost for Presto. Learn about this CBO’s internals, the motivating use cases and observed improvements.

Alun Biffin received his PhD in condensed matter physics at the University of Oxford before being awarded a Marie Curie Fellowship and going on to the Paul Scherrer Institute, Switzerland, to continue his research. He designed and conducted ground-breaking experiments on quantum magnets at cutting edge facilities in Europe, the US and Japan, and presented his work at international workshops and conferences. During this time he published three papers as first author and has been cited over 100 times. He went on to be chosen for the highly selective ASI Data Science Fellowship, London in the summer of 2018. Since then he has been applying his passion for machine learning to real-life business problems; ranging from analyzing millions of webhits for online retailer Tails.com, to predicting customer behavior at one of the Netherland’s largest private banks, and his current employer, Van Lanschot Kempen.

Presentations

Using Machine Learning for Stock Picking Session

In this talk we describe how machine learning revolutionized the stock picking process for portfolio managers at Kempen Capital Management by filtering the vast small-cap, investment universe down to a handful of optimal stocks.

Peter is a principal director at Accenture Belux specialized in data-driven architectures and solutions. With 15 years of experience, he is working mainly with Financial Services clients helping them adapt to the growing importance of data in today’s digital context. With a passion for innovation, Peter is leading the assets and offerings around data-driven architectures. He applies these at our clients increasing their ability to automate more and more decisions and interactions towards clients, prospects, suppliers as well as employees. Peter is a strong advocate for the power of metadata and believe that if our way in dealing with this changes , companies are able to drive automation to a new level. This will allow to combine both delivery as well as solution automation from the design phase resulting in many efficiency and effectiveness benefits.

Presentations

Leveraging metadata for automating delivery and operations of advanced data platforms Session

In this session we will explain how to use metadata to automate delivery and operations of a data platform. By injecting automation into the delivery processes we shorten the time-to-market while improving the quality of the initial user experience. Typical examples include: Data profiling and prototyping, Test automation, Continuous delivery and deployment, Automated code creation

David is passionate about helping businesses to build analytics-driven decision making to help them make quicker, smarter and bolder decisions. He leads customer strategy and insights at Harrods, the biggest and most iconic department store in Europe. He has previously built global analytics and insight capabilities for a number of leading global entertainment businesses covering television (the BBC), book publishing (HarperCollins Publishers) and the music industry (EMI Music), helping to drive each organization’s decision making at all levels. He builds on experiences working to build analytics for global retailers as well as political campaigns in the US and UK, in philanthropy and in strategy consulting.

Presentations

Combining Creativity and Analytics Keynote

Companies that harness creativity and data in tandem have growth rates twice as high as companies that don’t. David will share lessons from his successes and failures in trying to do just that across presidential politics, with pop stars and now with power brands in the world of luxury goods. Find out how analysts can work differently to build these partnerships and unlock this growth!

Claudiu Branzan is the vice president of data science and engineering at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial for scalable NLP using the highly performant, highly scalable open-source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Mikio Braun is principal engineer for search at Zalando, one of Europe’s biggest fashion platforms. He worked in research for a number of years before becoming interested in putting research results to good use in the industry. Mikio holds a PhD in machine learning.

Presentations

Fair, privacy preserving and secure ML

In this talk, we will look at techniques and concepts around fairness, privacy, and security when it comes to machine learning models.

Nicolette is the Head of Data Engineering at Santander UK Technology. She is a technical manager with 18 years’ experience in the IT services industry, and has previously led large-scale multi-location change projects comprising of: Data Provision, Managed MI and Data warehouses, ETL, System Integration and IT alignment.

Presentations

Designing the foundation for a data-driven future in financial services Findata

Attend this session to learn more about the way Santander have restructured their business around data. Learn about the people, processes and technology they brought together to make it a success and get practical ideas to help you start or progress your journey with big data.

James Burke has been called “One of the most intriguing minds in the Western world” (Washington Post). His audience is global. His influence in the field of the public understanding of science and technology is acknowledged in citations by such authoritative sources as the Smithsonian and Microsoft CEO Bill Gates. His work is on the curriculum of universities and schools across the United States.

In 1965 James Burke began work with BBC-TV on Tomorrow’s World and went on to become the BBC’s chief reporter on the Apollo Moon missions. For over forty years he has produced, directed, written and presented award-winning television series on the BBC, PBS, Discovery Channel and The Learning Channel. These include historical series, such as Connections (aired in 1979, it achieved the highest-ever documentary audience); The Day the Universe Changed; Connections2 and Connections3; a one-man science series, The Burke Special; a mini-series on the brain, The Neuron Suite; a series on the greenhouse effect, After the Warming; and a special for the National Art Gallery on Renaissance painting, Masters of Illusion.

A bestselling author, his publications include: Tomorrow’s World, Tomorrow’s World II, Connections, The Day the Universe Changed, Chances, The Axemaker’s Gift (with Robert Ornstein), The Pinball Effect, The Knowledge Web, Circles, and American Connections. He has also written a series of introductions for the book Inventing Modern America (MIT, 2002) and was a contributing author to Talking Back to the Machine (Copernicus, 1999) and Leading for Innovation (Drucker Foundation, 2002).

His book, Twin Tracks: The Unexpected Origins of the Modern World, focuses on the surprising connections among the seemingly unconnected people, events and discoveries that have shaped our world. Burke has also written and hosted a bestselling CD-ROM, Connections: A Mind Game and provided consult and scripting for Disney Epcot.

Burke is a frequent keynote speaker on the subject of technology and social change to audiences such as NASA, MIT, IBM, Microsoft, US Government Agencies and the World Affairs Council. He has also advised the National Academy of Engineering, The Lucas Educational Foundation and the SETI project.

He was a regular columnist for six years at Scientific American, and, most recently, contributed an essay on invention to the Britannica Online Encyclopedia. Burke is currently a contributor to TIME magazine. His most recent television work is a PBS retrospective of his work, ReConnections.

Educated at Oxford and holding honorary doctorates for his work in communicating science and technology, his latest project is an online interactive knowledge-mapping system (the ‘knowledge web’: www.k-web.org) to be used as a teaching aid, a management tool and a predictor. It is due to be online in 2020.

His next book, The Culture of Scarcity, will be published in 2020.

Presentations

Making the Future Keynote

Technology changes so fast these days, we spend much of our time just keeping up. Prediction, difficult enough at any time, is made even more complex when Big Data and predictive analytics immensely increase the number of options we need to consider.

Julia is an AI evangelist for Scout24 and actively driving the culture change within Scout24. Julia has a strong background in product development including data products, strategy and innovation. She is an initiator of forward thinking. She energizes through her creativity and enthusiasm.

Presentations

From data to data-driven to an AI-ready company - the culture change makes the difference DCS

To create value out of your data it is not about technology or engineers. It is all about changing the culture in the company to make everyone aware about data and how to build on top of data. At Scout24 we running a successful culture change and already have 60% of employees using our central BI tool. Since 2018 it is all about AI enablement.

Dr Paris Buttfield-Addison is co-founder of Secret Lab, a game development studio based in beautiful Hobart, Australia. Secret Lab builds games and game development tools, including the multi-award-winning ABC Play School iPad games, the BAFTA- and IGF-winning Night in the Woods, the Qantas airlines Joey Playbox games, and the open source Yarn Spinner narrative game framework. Previously, Paris was mobile product manager for Meebo (acquired by Google). Paris particularly enjoys game design, statistics, the blockchain, machine learning, and human-centered technology research and writes technical books on mobile and game development (more than 20 so far) for O’Reilly Media. He holds a degree in medieval history and a PhD in computing. Find him online at http://paris.id.au and @parisba

Presentations

Science-Fictional User Interfaces Session

Science-fiction has been showcasing complex, AI-driven (often AR or VR) interfaces (for huge amounts of data!) for decades. As television, movies, and video games became more capable of visualising a possible future, the grandeur of these imagined science fictional interfaces has increased. What can we learn from Hollywood UX? Is there a useful takeaway? Does sci-fi show the future of AI UX?

Tatiane Canero is Patient Flow manager at Hospital Israelita Albert Einstein in São Paulo, Brazil, being in charge the last 8 years for orchestrating all clinical and support services areas to care for over 85.000 patients yearly.
She has being responsible for implementing several process and clinical improvement initiatives aiming to release hospital capacity and maximize patient safety, experience and caring level. These initiatives have released additional 60 virtual beds capacity yearly.
Digital tech enthusiastic, she has being engaged in how AI can disrupt patient flow. The first relevant outcome was IRIS plataform, which she intends to extend to public hospitals managed by the hospital.

Presentations

Insightful Health - Amplifying Intelligence in Healthcare Patient Flow Execution Session

How Albert Einstein and Accenture evolved patient flow experience and efficiency with the use of applied AI, statistics and combinatorial math, allowing the hospital to antecipate E2E visibility within patient flow operations, from admission of emergency and ellective demands, to assignment and medical releases.

Data science expert and software system architect with expertise in machine-learning and big-data systems. Rich experiences of leading innovation projects and R&D activities to promote data science best practice within large organizations. Deep domain knowledge on various vertical use cases (Finance, Telco, Healthcare, etc.). Currently working pushing the cutting-edge application of AI at the intersection of high-performance database and IoT, focusing on unleashing the value of spatial-temporal data. I am also a frequent speaker at various technology conferences, including: O’Reilly Strata AI Conference, NVidia GPU Technology Conference, Hadoop Summit, DataWorks Summit, Amazon re:Invent, Global Big Data Conference, Global AI Conference, World IoT Expo, Intel Partner Summit, presenting keynote talks and sharing technology leadership thoughts.

Received my Ph.D. from the Department of Computer and Information Science (CIS), University of Pennsylvania, under the advisory of Professor Insup Lee (ACM Fellow, IEEE Fellow). Published and presented research paper and posters at many top-tier conferences and journals, including: ACM Computing Surveys, ACSAC, CEAS, EuroSec, FGCS, HiCoNS, HSCC, IEEE Systems Journal, MASHUPS, PST, SSS, TRUST, and WiVeC. Served as reviewers for many highly reputable international journals and conferences.

Presentations

Building The Data Infrastructure For The Internet Of Things At Zettabyte-Scale Session

We would like to share the architecture design and many detailed technology innovations of Alibaba TSDB, a state-of-the-art database for IoT data management, from years of development and continuous improvement.

Jean-Luc Chatelain is a managing director for Accenture Digital and the CTO for Accenture Applied Intelligence, where he focuses on helping Accenture customers become information-powered enterprises by architecting state-of-the-art big data solutions. Previously, Jean-Luc was the executive vice president of strategy and technology for DataDirect Networks Inc. (DDN), the world’s largest privately held big data storage company, where he led the company’s R&D efforts and was responsible for corporate and technology strategy; a Hewlett-Packard fellow and vice president and CTO of information optimization responsible for leading HP’s information management and business analytics strategy; founder and CTO of Persist Technologies (acquired by HP), a leader in hyperscale grid storage and archiving solutions whose technology is the basis of the HP Information Archiving Platform IAP; and CTO and senior vice president of strategic corporate development at Zantaz, a leading service provider of information archiving solutions for the financial industry, where he played an instrumental role in the development of the company’s services and raised millions of dollars in capital for international expansion. He has been a board member of DDN since 2007. Jean-Luc studied computer science and electrical engineering in France and business at Emory University’s Goizueta Executive Business School. He is bilingual in French and English and has also studied Russian and classical Greek.

Presentations

Executive Briefing: Using a Domain Knowledge Graph to Manage AI at Scale Session

How do enterprises scale moving beyond one-off AI projects to making it re-usable? Teresa Tung and Jean-Luc Chatelain explain how domain knowledge graphs—the same technology behind today's Internet search—can bring the same democratized experience to enterprise AI. Beyond search applications, we show other applications of knowledge graphs in oil & gas, financial services, and enterprise IT.

Leading Business Intelligence Product Management at Uber.

Previously, a founding team and Senior Product Manager at ThoughtSpot. Helped build ThoughtSpot from 10 to 300+ people in 5 years. Created the world’s first analytics search engine at ThoughtSpot.

Education: University of California Berkeley, University of Illinois Urbana-Champaign, IIT Guwahati

Presentations

Integrated Business Intelligence Suite at Uber: How we built a platform to convert raw data into knowledge (insights) Session

Our experience with building the Business Intelligence platform has been nothing short of extraordinary. The proposal contains details about how Uber thought about building it's Business Intelligence platform. In this talk, I’ll narrate the journey of deciding on how we took a platform approach rather than adding features in a piecemeal fashion.

Dr. Sanjian Chen is a data science researcher with deep knowledge in large-scale machine learning algorithms. He has developed cutting-edge data-driven modeling techniques and autonomous systems in both academic and industry settings. He has designed data-analytics solutions that drove numerous high-impact business decisions for multiple Fortune 500 companies across several industries, including retail, banking, automotive, and telecommunications. He is currently working on building cutting-edge AI engines for high-performance database systems that support large-scale data analytics in multiple business areas. Dr. Chen is a frequent invited speaker at top international conferences, including the Artificial Intelligence Conference (New York), Strata Data Conference (San Francisco), the IEEE Cyber-Physical Systems Week (Chicago), the IFAC conference on Analysis and Design of Hybrid Systems (Atlanta), and IEEE International Conference on Healthcare Informatics (Philadelphia, Dallas)

Dr. Chen received his Ph.D. in Computer and Information Science at the University of Pennsylvania. He received two IEEE Best Paper Awards (IEEE RTSS 2012 and IEEE ISORC 2018). He has published over 25 papers in top journals and conferences, including 2 articles published in the Proceedings of IEEE (IF=9.1). He has served as an invited reviewer for numerous top international journals and conferences, e.g., the IEEE Design & Test, IEEE Transactions on Computers, ACM Transactions on Cyber-Physical Systems, IEEE Transactions on Industrial Electronics, IEEE RTSS conferences, and ACM HSCC conference.

Presentations

Building The Data Infrastructure For The Internet Of Things At Zettabyte-Scale Session

We would like to share the architecture design and many detailed technology innovations of Alibaba TSDB, a state-of-the-art database for IoT data management, from years of development and continuous improvement.

Zhiling is a ML engineer at GO-JEK, one of the fastest growing startups in Asia. She and her colleagues work on scaling machine learning and driving impact throughout the organization. Her focus is on improving the speed at which data scientists iterate, the accuracy and performance of their models, the scalability of the systems they build, and the impact they deliver.

Presentations

Unlocking insights in AI by building a feature store Session

Features are key to driving impact with AI at all scales. By democratizing the creation, discovery, and access of features through a unified platform, organizations are able to dramatically accelerate innovation and time to market. Find out how GOJEK, Indonesia's first billion-dollar startup, unlocked insights in AI by building a feature store called Feast, and the lessons they learned along the..

Felix Cheung is an engineer at Uber and a PMC and committer for Apache Spark. Felix started his journey in the big data space about five years ago with the then state-of-the-art MapReduce. Since then, he’s (re-)built Hadoop clusters from metal more times than he would like, created a Hadoop distro from two dozen or so projects, and juggled hundreds to thousands of cores in the cloud or in data centers. He built a few interesting apps with Apache Spark and ended up contributing to the project. In addition to building stuff, he frequently presents at conferences, meetups, and workshops. He was also a teaching assistant for the first set of edX MOOCs on Apache Spark.

Presentations

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber Session

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame.

Divya Choudhary is a data scientist, currently working with a Jakarta based technology startup named GO-JEK. She is responsible for building algorithms and mathematical models to drive features across diversified products at GO-JEK.

With 4 years of work experience, Divya is a computer science engineer who has traversed her professional career from being an analyst to a decision scientist to a data scientist. The crux to any data science solution lies in having a problem-solving mindset & Divya has been known for her business acumen & problem solving approach across all the start-ups that she has been a part of.

Personal:

  • a yoga lover
  • a poetess
  • a painter
  • an avid trekker & wanderer who is best at talking to people and learning about them

Presentations

From random text in addresses to world class feature of precise locations using NLP Session

Data scientists around the globe would agree that addresses are the most unorganised textual data. Structuring addresses has almost led to a new stream of NLP itself. Who would've imagined that address text data can be used to develop one of the coolest product feature of finding the most precise pick up/drop-off locations for e-commerce, logistics, food delivery or ride/car services companies!

Ira Cohen is a cofounder and chief data scientist at Anodot, where he is responsible for developing and inventing the company’s real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Sequence-2-Sequence Modeling for Time Series Session

Recently, Sequence-2-Sequence has also been used for applications based on time series data. In this talk, we first overview S2S and the early use cases of S2S. Subsequently, we shall walk through how S2S modeling can be leveraged for the aforementioned use cases, viz., real-time anomaly detection and forecasting.

Robert Cohen is a senior fellow at the Economic Strategy Institute, where he is directing a new study to examine the economic and business impacts of machine learning and AI on firms and the U.S. economy.

Presentations

Data-driven digital transformation and Jobs: the New Software Hierarchy and ML Session

This talk describes the skills that employers are seeking from employees in digital jobs – linked to the new software hierarchy driving digital transformation. We describe this software hierarchy as one that ranges from DevOps, CI/CD, and microservices to Kubernetes and Istio. This hierarchy is used to define the jobs that are central to data-driven digital transformation.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, he was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow 2-Day Training

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow (Day 2) Training Day 2

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Presentations

Why is it so hard to do AI for Good? Session

DataKind UK has been working in data for good since 2013 working with over 100 uk charities, helping them to do data science for the benefit of their users. Some of those projects have delivered above and beyond expectations - others haven't. In this session Duncan and Giselle will talk about how to identify the right data for good projects...

Lidia Crespo is leading the Big Data Governance activities from the CDO team. She and her team have been instrumental to the adoption of the technology platform by creating a sense of trust and with their deep knowledge of the data of the organisation. With her experience in complex and challenging international projection projects and an audit, IT, and data background, Lidia brings a combination difficult to find.

Presentations

The vindication of Big data. How Hadoop is used in Santander UK to defend privacy. Session

Big data is usually regarded as a menace for data privacy. However, with the right principles and mind-set, it can be a game changer to put customers first and consider data privacy an inalienable right. Santander UK applied this model to comply with GDPR by using graph technology, Hadoop, Spark, Kudu to drive data obscuring and data portability, and driving machine learning exploration.

Samuel Cristobal, holds a MSc in Advanced Mathematics and Applications (Universidad Autónoma of Madrid); a BCs (with honors) in Mathematics (Universidad Complutense de Madrid); a BEng (valedictorian) in Telecommunication Systems (Universidad Politécnica de Madrid) and was a research associate fellow at the University of Vienna working on mathematical research with focus on algebraic geometry, logic and computer science.

Samuel has been a researcher at Innaxis for ten years, in which he successfully executed more than a dozen of Data Science projects in the field of aviation ranging from mobility to safety, mostly as the technical or scientific coordinator. Currently Samuel is the Science and Technology director at Innaxis, managing the research agenda of the institute.

Presentations

Machine Learning in aviation is finally taking off DCS

DataBeacon is a multi-sided data and machine learning platform for the aviation industry. Two applications will be presented: SmartRunway (machine learning solution to runway optimisation) and SafeOperations (operations safety predictive analytics).

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Data Case Studies Welcome Tutorial

Welcome to the Data Case Studies tutorial.

Thursday keynote welcome Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the first day of keynotes.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynote welcome Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the first day of keynotes.

Ivan Luciano Danesi has been working in UniCredit Services S.C.p.A. since July 2014 as a Data Scientist. His activities have been mainly focused on Risk Management and on Customer Relationship Management within the Big Data framework. He achieved a PhD in Statistics at University of Padua, in which he carried research activities on in collaboration with Università di Trieste (Trieste) and CASS Business School (London). He is a teaching assistant and research collaborator in the Department of Statistics at Università Cattolica del Sacro Cuore (Milan).

Presentations

A Big Data Customer Relationship Management Case Study for Banking Findata

The presented use case describes the construction of a customer relationship management application in a Big Data environment. More than 50 models (monthly refreshed) have been deployed within this project. Aim of each model is customer development (cross-selling) or attrition management (churn). Input for the analysis are heterogeneous data coming from 7 different Countries.

Ifi Derekli is a Senior Solutions Engineer at Cloudera, focusing on helping large enterprises solve big data problems using Hadoop technologies. Her subject matter expertise is around security and governance which is a crucial component of every successful production big data use case. Prior to Cloudera, Ifi was a Presales Technical Consultant at Hewlett-Packard Enterprise where she provided technical expertise for Vertica and IDOL (currently part of Micro Focus). She holds a B.S. in Electrical Engineering and Computer Science from Yale University.

Presentations

Getting ready for GDPR and CCPA: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to CCPA.

Apurva joined Google over 3 years ago. He leads the Dataproc, Composer and CDAP products in the Data Analytics team. Prior to Google, Apurva was at Lenovo/Motorola leading their Mobile Cloud team for a year. Prior to that, he spent 3.5 years at Pivotal Software where he built and commercialized Pivotal’s Hadoop distribution. Prior to that he spent 6 years at Yahoo leading various search & display advertising efforts as well as the Hadoop solutions team. He holds a Master’s degree in EE from Simon Fraser University, B.C., Canada.

Presentations

Migrating Apache Oozie Workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems where the former focuses on Apache Hadoop jobs. We see a need to build oozie to Airflow workflow mapping as a part of creating an effective cross-cloud/cross-system solution. This talk aims to introduce an open-source Oozie-to-Airflow migration tool developed at Google.

Wolff Dobson is a developer programs engineer at Google specializing in machine learning and games. Previously, he worked as a game developer, where his projects included writing AI for the NBA 2K series and helping design the Wii Motion Plus. Wolff holds a PhD in artificial intelligence from Northwestern University.

Presentations

TensorFlow For Everyone Session

In this talk, we will cover the latest in TensorFlow, both for beginners and for developers migrating from 1.x to 2.0. We'll cover the best ways to set up your model, feed your data to it, and distribute it for fast training. We'll also look at how TensorFlow has been recently upgraded to be more intuitive.

David Dogon was born in Cape Town, South Africa, the same city where he completed a bachelor’s degree in Chemical Engineering. Being a bit of an adventurer he moved to New York to study his master’s in the same field at Columbia University. In 2012 he moved to the Netherlands and performed research towards a PhD degree in Mechanical Engineering at TU Eindhoven. Always being driven by an interest in the insights and predictive power from data, he made the shift to the broad field of data science in 2016. As a data scientist he has worked primarily in financial services. He joined Van Lanschot Kempen in 2018 as part of a the data science team, where he has a primary focus on investments and asset management.

Presentations

Using Machine Learning for Stock Picking Session

In this talk we describe how machine learning revolutionized the stock picking process for portfolio managers at Kempen Capital Management by filtering the vast small-cap, investment universe down to a handful of optimal stocks.

Mark Donsky leads product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogeneous data environments. Previously, Mark led data management and governance solutions at Cloudera. Mark has held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive briefing: big data in the era of heavy worldwide privacy regulations Session

The General Data Protection Regulation (GDPR) went into place on May 25, 2018, for all organizations—both EU and non-EU—that offer services to EU residents, as well as anyone who controls or processes data within the EU. The State of California is following suit with its California Consumer Protection Act (CCPA), targeted to be enforced in 2020.

Getting ready for GDPR and CCPA: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to CCPA.

Ted Dunning is chief application architect at MapR. He’s also a board member for the Apache Software Foundation, a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects, and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He has contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Report Card on Streaming Microservices Session

As a community, we have been pushing streaming architectures, particularly microservices, for several years now. But what are the results in the field? I will describe several (anonymized) case histories and describe the good, the bad and the ugly. In particular, I will describe how several teams who were new to big data fared by skipping map-reduce and jumping straight into streaming.

Ananth Packkildurai works as a Senior data engineer at Slack manage core data infrastructures like Airflow, Kafka, Flink, and Pinot. He is passionate about all things related to ethical data management and data engineering.

Presentations

Reliable logging infrastructure @ Slack Session

Logs are everywhere. Every organization collects tons of data every day. The logs are as good as the trust it earns to make business-critical decisions. Building trust and reliability of logs are critical to creating a data-driven organization. Ananth walkthrough his experience building reliable logging infrastructure at Slack and how it helped to build confidence on data.

Maren is a Principal Data Scientist at QuantumBlack. She leads the analytics work on client projects, working across industries on predictive, explanatory and optimisation problems. Her role includes defining the analytical approach, developing the code base, building models and communicating the results.

Maren also leads the technical training programme for QuantumBlack’s Data Science team, arranges bespoke trainings, seminars, and conference attendance.

Before joining QB, Maren worked in demand forecasting and completed a PhD in probability theory at the University of Bath.

Presentations

Opening the black-box: Explainable AI (XAI) Session

The success of machine learning algorithms in a wide range of domains has led to a desire to leverage their power in ever more areas. In this session, we discuss modern explainability techniques which increase the transparency of black-box algorithms, drive adoption and help manage ethical, legal and business risks. Many of these methods can be applied to any model without limiting performance.

Yoav drives product management, technology vision, and go-to-market activities for GigaSpaces. Prior to joining GigaSpaces, Yoav filled various leading product management roles at Iguazio and Qwilt, mapping the product strategy and roadmap while providing technical leadership regarding architecture and implementation. Yoav brings with him more than 12 years of industry knowledge in product management and software engineering experience from high growth software companies. As an entrepreneur at heart, Yoav drives innovation and product excellence and successfully incorporates it with the market trends and business needs. Yoav holds a BSC in Computer Science and Business from Tel Aviv University Magna Cum Laude and an MBA in Finance from the Leon Recanati School in Tel Aviv University.

Presentations

A Deep Learning Approach to Automatic Call Routing Session

Technological advancements are transforming customer experience, and businesses are beginning to benefit from Deep Learning innovations to automate call center routing to the most proper agent. This session will discuss how Deep Learning models can be run with Intel BigDL and Spark frameworks co-located on an in-memory computing platform to enhance the customer experience without the need for GPUs

How NLP Is Helping a European Financial Institution Enhance Customer Experience Findata

How a leading IT Service Provider for financial firms leverages NLP, helping service agents provide first call resolution quickly and efficiently to enhance CX and reduce time the agent spends on the line, lowering operational costs. The system responds with sub-second latency and creates continuous learning models based on each transaction, ensuring updated models for smarter, faster insights.

As CTO, Geir leads the R&D department in developing the Cognite industrial IoT data platform. Geir was founder and CEO/CTO at Snapsale, a machine learning classifieds startup that was acquired by Schibsted. Prior to this, he worked three years as senior software engineer at Google in Canada, where he worked on machine learning for AdWords and AdSense, resulting in the Conversion Optimizer product. Geir has an MSc in computational science from the University of Oslo, and has won a silver medal from the International Olympiad in Informatics.

Presentations

Processing 10M samples/second to drive smart maintenance in complex IIoT systems Session

Learn how Cognite is developing IIoT smart maintenance systems that can process 10M samples/second from thousands of sensors. We’ll review an architecture designed for high performance, robust streaming sensor data ingest and cost-effective storage of large volumes of time series data, best practices for aggregation and fast queries, and achieving high-performance with machine learning.

Moty Fania is a principal engineer for big data analytics at Intel IT and the CTO of the Advanced Analytics Group, which delivers big data and AI solutions across Intel. With over 15 years of experience in analytics, data warehousing, and decision support solutions, Moty leads the development and architecture of various big data and AI initiatives, such as IoT systems, predictive engines, online inference systems, and more. Moty holds a bachelor’s degree in economics and computer science and a master’s degree in business administration from Ben-Gurion University.

Presentations

Building a Sales AI platform – key principles and lessons learned Session

In this session, Moty Fania will share his experience of implementing a Sales AI platform. It handles processing of millions of website pages and sifting thru millions of tweets per day. The platform is based on unique open source technologies and was designed for real-time, data extraction and actuation. This session highlights the key learnings with a thorough review of the architecture.

Fabio Ferraretto is Lead for Data Science for Accenture in Latin America, where he manage 215 creative, innovators and excited data scientists. He applies advanced analytics, optimization, combinatorial math, predictive and Artificial Intelligence to solve the complex and challenging business problems on clients from Healthcare, Telecom, CPG, Mining and other industries. He has a degree in Civil Engineering from Escola Politecnica from USP, and works in Accenture since 2002 applying analytics to business challenges. He lead Gapso Analytics acquisition in 2015 and its integration with Accenture.

Presentations

Insightful Health - Amplifying Intelligence in Healthcare Patient Flow Execution Session

How Albert Einstein and Accenture evolved patient flow experience and efficiency with the use of applied AI, statistics and combinatorial math, allowing the hospital to antecipate E2E visibility within patient flow operations, from admission of emergency and ellective demands, to assignment and medical releases.

Ilan Filonenko is a member of the Data Science Infrastructure team at Bloomberg, where he has designed and implemented distributed systems at both the application and infrastructure level. He is one of the principle contributors to Spark on Kubernetes, primarily focusing on the effort to enabled Secure HDFS interaction and non-JVM support. Previously, Ilan was an engineering consultant and technical lead in various startups and research divisions across multiple industry verticals, including medicine, hospitality, finance, and music. Ilan’s currently researches algorithmic, software, and hardware techniques for high-performance machine learning, with a focus on optimizing stochastic algorithms and model management.

Presentations

Cross-Cloud Model Training and Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Piotr Findeisen is a Software Engineer and a founding member of the Starburst team. He contributes to the Presto code base and is also active in the community. Piotr has been involved in the design and development of significant features like the cost-based optimizer (still in development), spill to disk, correlated subqueries and a plethora of smaller enhancements.

Before Starburst, Piotr worked at Teradata and was the top external Presto committer of the year. Prior to that, he was a Team Leader at Syncron (provider of cloud services for supply chain management), responsible for their product’s technical foundation and performance.

Piotr holds a M.S. in Computer Science (and a B.Sc. in Mathematics) from University of Warsaw.

Presentations

Presto. Cost-Based Optimizer for interactive SQL-on-Anything Session

Presto is a popular open source distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3/Azure ADSL, RDBMS, no-SQL, etc). Recently Starburst has contributed the Cost-Based Optimizer for Presto which brings a great performance boost for Presto. Learn about this CBO’s internals, the motivating use cases and observed improvements.

Marcel works as a software engineer in the analytics team of the Wikimedia Foundation since October 2014. He believes it’s a privillege to be able to professionally contribute to Wikipedia and the free knowledge movement. He’s also worked on quite disparate things such as recommender systems, serious games, natural language processing and… selling hand-painted t-shirts on the beach of Natal, Brazil.

Presentations

The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit Session

Analysts and researchers studying Wikipedia are hungry for long term data to build experiments and feed data-driven decisions. But Wikipedia has a strict privacy policy that prevents storing privacy-sensitive data over 90 days. The Wikimedia Foundation's analytics team is working on a vegan data diet to satisfy both.

Don Fox is a Boston-based data scientist in residence at the Data Incubator. Previously, Don developed numerical models for a geothermal energy startup. Born and raised in South Texas, Don holds a PhD in chemical engineering, where he researched renewable energy systems and developed computational tools to analyze the performance of these systems.

Presentations

Hands-On Data Science with Python 2-Day Training

We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into two applications from real-world datasets. All work will be done in Python.

Hands-On Data Science with Python (Day 2) Training Day 2

We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into two applications from real-world datasets. All work will be done in Python.

Michael J. Freedman is the co-founder and CTO of Timescale, as well as a Full Professor of Computer Science at Princeton University. He did his PhD at NYU and Stanford, and his undergraduate and masters at MIT.

Freedman’s work broadly focuses on distributed and storage systems, networking, and security, with publications having more than 12,000 citations. He developed CoralCDN (a decentralized content distribution network serving millions of daily users) and helped design Ethane (which formed the basis for OpenFlow / software-defined networking). He previously co-founded Illuminics Systems (acquired by Quova, now part of Neustar) and serves as a technical advisor to Blockstack.

Honors include a Presidential Early Career Award for Scientists and Engineers (given by President Obama), SIGCOMM Test of Time Award, Sloan Fellowship, NSF CAREER, Office of Naval Research Young Investigator , and DARPA Computer Science Study Group.

Presentations

Performant time-series data management and analytics with Postgres Session

Requirements of time-series databases include ingesting high volumes of structured data; answering complex, performant queries for both recent & historical time intervals; & performing specialized time-centric analysis & data management. I explain how one can avoid these operational problems by re-engineering Postgres to serve as a general data platform, including high-volume time-series workloads

Brandon Freeman is a Mid-Atlantic region strategic system engineer at Cloudera, specializing in infrastructure, the cloud, and Hadoop. Previously, Brandon was an infrastructure architect at Explorys, working in operations, architecture, and performance optimization for the Cloudera Hadoop environments, where he was responsible for designing, building, and managing many large Hadoop clusters.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Michael Freeman is a Senior Lecturer at the Information School at the University of Washington, where he teaches courses on data science, data visualization, and web development. With a background in public health, Michael works alongside research teams to design and build interactive data visualizations to explore and communicate complex relationships in large datasets. Previously, he was a data visualization specialist and research fellow at the Institute for Health Metrics and Evaluation, where he performed quantitative global health research and built a variety of interactive visualization systems to help researchers and the public explore global health trends. Michael is interested in applications of data visualization to social change. He holds a master’s degree in public health from the University of Washington. You can take a look at samples from his projects on his webiste.

Presentations

Visually Communicating Statistical and Machine Learning Methods Session

Statistical and machine learning techniques are only useful when they're understood by decision makers. While implementing these techniques is easier than ever, communicating about their assumptions and mechanics is not. In this session, participants will learn a design process for crafting visual explanations of analytical techniques and communicating them to stakeholders.

Brandy Freitas is a principal data scientist at Pitney Bowes, where she works with clients in a wide variety of industries to develop analytical solutions for their business needs. Brandy is a research physicist-turned-data scientist based in Boston, MA. Her academic research focused primarily on protein structure determination, applying machine learning techniques to single-particle cryoelectron microscopy data. Brandy is a National Science Foundation Graduate Research Fellow and a James Mills Pierce Fellow. She holds an undergraduate degree in physics and chemistry from the Rochester Institute of Technology and did her graduate work in biophysics at Harvard University.

Presentations

Executive Briefing: Analytics for Executives Session

Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. In this session, Harvard Biophysicist-turned-Data Scientist, Brandy Freitas, will work with participants to develop context and vocabulary around data science topics to help build a culture of data within their organization.

Ellen Friedman is Principal Technologist for MapR Technologies. She is a committer on the Apache Drill and Apache Mahout projects and coauthor of a number of short books on big data topics including AI and Analytics in Production, Machine Learning Logistics, Streaming Architecture, the Practical Machine Learning series, and Introduction to Apache Flink. Ellen has been an invited speaker at Strata Data conferences, Big Data London, Berlin Buzzwords, Nike Tech Talks, the University of Sheffield Methods Institute and NoSQL Matters Barcelona. She holds a PhD in biochemistry.

Presentations

Executive Briefing: 5 Things Every Executive Should *Not* Know Session

A surprising fact of modern technology is that not knowing some things can make you better at what you do. This isn’t just lack of distraction or being too delicate to face reality. It’s about separation of concerns, with a techno flavor. In this talk I go through five things that best practice with emerging technologies and new architectures can give us ways to not know, and why that’s important.

Matt Fuller is cofounder at Starburst, the Presto company. Matt has held engineering roles in the data warehousing and analytics space for the past 10 years. Previously, he was director of engineering at Teradata, leading engineering teams working on Presto, and was part of the team that led the initiative to bring open source, in particular Presto, to Teradata’s products. Before that, Matt architected and led development efforts for the next-generation distributed SQL engine at Hadapt (acquired by Teradata in 2014) and was an early engineer at Vertica Systems (acquired by HP), where he worked on the Query Optimizer.

Presentations

Learning Presto: SQL-on-Anything Tutorial

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL-on-Anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from Gigabytes to Petabytes. In this tutorial, attendees will learn Presto usages, best practices, and optional hands on exercises.

Marina Rose Geldard, more commonly known as Mars, is a technologist from Down Under in Tasmania. Entering the world of technology relatively late as a mature-age student, she has found her place in the world: an industry where she can apply her lifelong love of mathematics and optimization. When she is not busy being the most annoyingly eager student ever, she compulsively volunteers at industry events, dabbles in research, and serves on the executive committee for her state’s branch of the Australian Computer Society (ACS) as well as the AUC (http://auc.edu.au). She is currently writing ‘Practical Artificial Intelligence with Swift’, for O’Reilly Media, and working on machine learning projects to improve public safety through public CCTV cameras in her home town of Hobart.

Presentations

Science-Fictional User Interfaces Session

Science-fiction has been showcasing complex, AI-driven (often AR or VR) interfaces (for huge amounts of data!) for decades. As television, movies, and video games became more capable of visualising a possible future, the grandeur of these imagined science fictional interfaces has increased. What can we learn from Hollywood UX? Is there a useful takeaway? Does sci-fi show the future of AI UX?

Oliver Gindele is Head of Machine Learning at Datatonic. He studied Materials Science at ETH Zurich and moved to London to obtain his PhD in computational physics from UCL. Oliver is passionate about using computers models to solve real-world problems for which he joined Datatonic to create bespoke machine learning solutions. Working with clients in retail, finance and telecommunications Oliver applies deep learning techniques to tackle some of the most challenging use cases in these industries.

Presentations

Deep Learning for Recommender Systems Session

The success of Deep Learning has reached the realm of structured data in the past few years where neural network have shown to improve the effectiveness and predictability of recommendation engines. This session will give a brief overview of such deep recommender systems and how they can be implemented in TensorFlow.

Emily has over ten years of experience in scientific computing and engineering research and development. She has a background in mathematical analysis, with a focus on probability theory and numerical analysis. She is currently working in Python development, though she has a background that includes C#/.Net, Unity3D, SQL, and MATLAB. In addition, she has experience in statistics and experimental design, and has served as Principal Investigator in clinical research projects.

Presentations

Continuous Intelligence: Keeping your AI Application in Production Session

Machine learning can be challenging to deploy and maintain. Data change, and both models and the systems that implement them must be able to adapt. Any delays moving models from research to production means leaving your data scientists' best work on the table. In this talk, we explore continuous delivery (CD) for AI/ML, and explore case studies for applying CD principles to data science workflows.

Ever since the completion of her studies, Caroline has been nurturing a passion for how information can be expressed, shared and understood. In 2010, sensing that the rich data era will transform the way we work, learn and communicate, she co-founded Dataveyes, a studio specialized in Human-Data interactions. Within Dataveyes, she translates data into interactive experiences, in order to reveal new insightful stories, accompany new uses and understand our environment shaped by data and algorithms.

Presentations

When you don’t really know what to do with this huge pile of strategic data DCS

This case study aims to show how to leverage data to rethink the way an industry builds its offer, improve the customers experience, and feeds its prospective reflection. We will focus on the approach, the methodology and the result achieved for a leader public transport operator.

Sonal is the founder and CEO at Nube Technologies, a startup focussed on big data preparation and analytics. Nube Technologies builds business applications for better decision making through better data. Nube’s fuzzy matching product Reifier helps companies get a holistic view of enterprise data. By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data. We help our customers build better and effective models by ensuring that their underlying master data is accurate.

Presentations

Mastering Data with Spark and Machine Learning Session

Enterprise data on customers, vendors, products etc is siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting and 360 views. Traditional rule based MDM systems with legacy architectures struggle to unify this growing data. This talk covers a modern master data application using Spark, Cassandra, ML and Elastic.

Trevor Grant is committer on the Apache Mahout, and contributor on Apache Streams (incubating), Apache Zeppelin, and Apache Flink projects and Open Source Technical Evangelist at IBM. In former rolls he called himself a data scientist, but the term is so over used these days. He holds an MS in Applied Math and an MBA from Illinois State University. Trevor is an organizer of the newly formed Chicago Apache Flink Meet Up, and has presented at Flink Forward, ApacheCon, Apache Big Data, and other meetups nationwide.

Trevor was a combat medic in Afghanistan in 2009, and wrote an award winning undergraduate thesis between missions. He has a dog and a cat and a 64 Ford and he loves them all very much.

Presentations

Cross-Cloud Model Training and Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Jay is a final year student at King’s College London studying Computer Science. She joined Hotels.com in the Big Data Platform team for her industrial placement year where she spent time working with Apache Hive, modularization techniques for SQL, and mutation testing tools.

Presentations

Mutant Tests Too: The SQL Session

Hotels.com describe approaches for applying software engineering best practices to SQL-based data applications in order to improve maintainability and data quality. Using open source tools we show how to build effective test suites for Apache Hive code bases. We also present Mutant Swarm, a mutation testing tool we’ve developed to identify weaknesses in tests and to measure SQL code coverage.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Lyft Data Platform, Now and in the future Session

Lyft’s data platform is at the heart of Lyft’s business. Decisions all the way from pricing, to ETA, to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. In this talk, Mark Grover walks through various choices Lyft has made in the development and sustenance of the data platform and why along with what lies ahead in future.

Nischal HP is currently the VP of Engineering at Berlin based AI startup omnius which operates in the building of AI product for the insurance industry. Previously, he was a cofounder and data scientist at Unnati Data Labs, where he worked towards building end-to-end data science systems in the fields of fintech, marketing analytics, event management and medical domain. Nischal is also a mentor for data science on Springboard. During his tenure at former companies like Redmart and SAP, he was involved in architecting and building software for ecommerce systems in catalog management, recommendation engines, sentiment analyzers , data crawling frameworks, intention mining systems and gamification of technical indicators for algorithmic trading platforms. Nischal has conducted workshops in the field of deep learning and has spoken at a number of data science conferences like Oreilly strata San jose 2017, PyData London 2016, Pycon Czech Republic 2015, Fifthelephant India (2015 and 2016), Anthill, Bangalore 2016. He is a strong believer of open source and loves to architect big, fast, and reliable AI systems. In his free time, he enjoys traveling with his significant other, music and groking the web.

Presentations

Deep Learning for Fonts Session

Deep Learning has enabled massive breakthroughs in offbeat tracks and has enabled better understanding of how an artist paints, how an artist composes music and so on. As part of Nischal & Raghotham’s loved project - Deep Learning for Humans, they want to build a font classifier and showcase to masses how fonts : * Can be classified * Understand how and why two or more fonts are similar

Christian Hidber has a PhD in computer algebra from ETH Zurich. He did a postdoc at UC Berkeley where he researched online data mining algorithms.
Currently he applies machine learning to industrial hydraulics simulation, part of a product with 7000 installations in 42 countries.

Presentations

Reinforcement Learning: a Gentle Introduction & Industrial Application Session

Reinforcement learning (RL) learns complex processes autonomously like walking, beating the world champion in go or flying a helicopter. No big data sets with the “right” answers are needed: the algorithms learn by experimenting. We show “how” and “why” RL works in an intuitive fashion & highlight how to apply it to an industrial, hydraulics application with 7000 clients in 42 countries.

Mark Hinely, Esq., is Director of Regulatory Compliance at KirkpatrickPrice and a member of the Florida Bar, with 10 years of experience in data privacy, regulatory affairs, and internal regulatory compliance. His specific experiences include performing mock regulatory audits, creating vendor compliance programs and providing compliance consulting. He is also SANS certified in the Law of Data Security and Investigations.

As GDPR has become a revolutionary data privacy law around the world, Mark has become the resident GDPR expert at KirkpatrickPrice. He has led the GDPR charge through internal training, developing free, educational content, and performing gap analyses, assessments, and consulting services for organizations of all sizes.

Presentations

The Future of Data Privacy Law: It’s Getting Personal Session

Organizations across the globe are trying to determine whether GDPR applies to them. Now, it seems as though GDPR principles are headed to the US. In 2018 alone, more ten states have passed or amended consumer privacy and breach notification laws. Mark Hinely will provide insight on the current and future data privacy laws in the US and how they will impact organizations across the globe.

Ana Hocevar obtained her PhD in Physics before becoming a postdoctoral fellow at the Rockefeller University where she worked on developing and implementing an underwater touchscreen for dolphins. She has over 10 years of experience in physics and neuroscience research and over 5 years of teaching experience. Now she combines her love for coding and teaching as a Data Scientist in Residence at The Data Incubator.

Presentations

Machine Learning from Scratch in TensorFlow 2-Day Training

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. This training will introduce TensorFlow's capabilities in Python. It will move from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow with several hands-on applications.

Machine Learning from Scratch in TensorFlow (Day 2) Training Day 2

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. This training will introduce TensorFlow's capabilities in Python. It will move from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow with several hands-on applications.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

Protecting sensitive data in huge datasets: Cloud tools you can use Session

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. We will explore massive public datasets, taking you from theory to real life showcasing newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm (with options such as removing, masking, and coarsening).

Matthew Honnibal is the creator and lead developer of spaCy, one of the most popular libraries for Natural Language Processing. He has been publishing research on NLP since 2005, with a focus on syntactic parsing and other structured prediction problems. He left academia to start working on spaCy in 2014.

Presentations

Agile NLP workflows with spaCy and Prodigy Session

In this talk, I'll discuss "one weird trick" that can give your NLP project a better chance of success. The advice is this: avoid a "waterfall" methodology where data definition, corpus construction, modelling and deployment are performed as separate phases of work.

Christopher Hooi is the Deputy Director of Communications & Sensors at the Land Transport Authority of Singapore. He is passionate about harnessing big data innovations to address complex land transport issues. Since 2010, he has embarked on a long term digital strategy with the main aim of achieving smart urban mobility in a fast changing digital world. Central to this strategy is to build and sustain a land transport digital ecosystem through an extensive network of sensor feeds, analytical processes and commuter outreach channels, synergistically put together to deliver a people-centred land transport system.

Presentations

Early Incident Detection using Fusion Analytics of Commuter-Centric Data Sources Session

The Fusion Analytics for Public Transport Event Response (FASTER) system provides a real-time advanced analytics solution for early warning of potential train incidents. Using novel fusion analytics of multiple data sources, FASTER harnesses the use of engineering and commuter-centric IoT data sources to activate contingency plans at the earliest possible time and reduce impact to commuters.

He’s one of Dataiku’s top data scientists but Alexandre Hubert began his career in a very different domain. After four years as a trader in the city, he realised that, with the huge amount of data out there, it was possible – and fun! – to resolve problems using real life data. Since becoming a data scientist, Alexandre has worked on a range of use cases, from creating models that predict fraud to building specific recommendation systems. He especially loves using deep learning with text or sports data. Even when he’s playing sport or having fun with friends, Alexandre sees numbers and patterns everywhere, bringing him quickly back to his laptop to try out new ideas.

Presentations

Improving Infrastructure Efficiency with Unsupervised Algorithms Session

GRDF helps bring natural gas to nearly 11 million customers everyday. In partnership with GRDF, Dataiku worked to optimise the manual process of qualifying addresses to visit and ultimately save GRDF time and money. This solution was the culmination of a year-long adventure in the land of maintenance experts, legacy IT systems and agile development.

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Solving data cleaning and unification using human-guided machine learning Session

Last year, we covered two primary challenges in applying machine learning to data curation: entity consolidation & using probabilistic inference to suggest data repair for identified errors and anomalies. This year, we'll cover these limitations in greater detail and explain why data unification projects common to quickly require human guided machine learning and a probabilistic model.

Rashed Iqbal is the cofounder of Narrative Economics. He works as Chief Technology Officer for a government investment fund in the UAE. Earlier he worked for Teledyne Technologies, Inc., Western Digital Corporation, Synopsys, Inc., and others in USA in technology and management roles.

Rashed teaches graduate courses in Data Science in the Economics Department at UCLA. He also teaches at UC Irvine, and in UCLA Extension. His areas of interest include Text Analytics, Natural Language Understanding, and Lean and Agile Development. Rashed has led multiple entrepreneurial ventures in Data Science. He holds a Ph.D. in Systems Engineering from the University of Sheffield, UK. He believes Narrative Modeling will revolutionize the process of human communication.

Presentations

Modeling the Tesla Narrative Findata

Despite fierce challenges, Tesla has upended not only the automotive and technology sectors but also our perception of disruption itself. Tesla and its enigmatic CEO, Elon Musk, have consistently used narratives to support their brand and market valuation. This talk presents a case study in the application of Narrative Modeling to the news and social media content about Tesla since its inception.

Amir Issaei is a Data Science Consultant at Databricks. He educates customers on how to leverage Databricks’ Unified Analytics Platform in Machine Learning (ML) projects. He also helps customers to implement ML solutions and to use Advanced Analytics to solve business problems. Before joining Databricks, he worked at American Airlines’ Operations Research department, where he supported Customer Planning, Airport and Customer Analytics groups. He received an MS in Mathematics from University of Waterloo and a Bachelor of Engineering Physics from University of British Columbia.

Presentations

Large-Scale ML with MLflow, Deep Learning and Apache Spark 2-Day Training

The course covers the fundamentals of neural networks and how to build distributed Keras/TensorFlow models on top of Spark DataFrames. Throughout the class, you will use Keras, TensorFlow, Deep Learning Pipelines, and Horovod to build and tune models. You will also use MLflow to track experiments and manage the machine learning lifecycle. NOTE: This course is taught entirely in Python.

Large-Scale ML with MLflow, Deep Learning and Apache Spark (Day 2) Training Day 2

The course covers the fundamentals of neural networks and how to build distributed Keras/TensorFlow models on top of Spark DataFrames. Throughout the class, you will use Keras, TensorFlow, Deep Learning Pipelines, and Horovod to build and tune models. You will also use MLflow to track experiments and manage the machine learning lifecycle. NOTE: This course is taught entirely in Python.

Maryam Jahanshahi is a research scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD from the Icahn School of Medicine at Mount Sinai, where she studied molecular regulators of organ size control. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of computation linguistics, machine learning, and behavioral economics methods.

Presentations

The Evolution of Data Science Skill Sets: An analysis using Exponential Family Embeddings Session

In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of job descriptions. The key takeaway is that these models can enrich analysis of specialized datasets.

Alejandro (Alex) Jaimes is Senior Vice president of AI and data science at Dataminr. His work focuses on mixing qualitative and quantitative methods to gain insights on user behavior for product innovation. Alex is a scientist and innovator with 15+ years of international experience in research leading to product impact at companies including Yahoo, KAIST, Telefónica, IDIAP-EPFL, Fuji Xerox, IBM, Siemens, and AT&T Bell Labs. Previously, Alex was head of R&D at DigitalOcean, CTO at AiCure, and director of research and video products at Yahoo, where he managed teams of scientists and engineers in New York City, Sunnyvale, Bangalore, and Barcelona. He was also a visiting professor at KAIST. He has published widely in top-tier conferences (KDD, WWW, RecSys, CVPR, ACM Multimedia, etc.) and is a frequent speaker at international academic and industry events. He holds a PhD from Columbia University.

Presentations

AI for Good at Scale in Real Time: Challenges in Machine Learning and Deep Learning Session

When emergency events occur, social signals and sensor data are generated. In this talk, I will describe how Machine Learning and Deep Learning are applied in processing large amounts of heterogeneous data from various sources in real time, with a particular focus on how such information can be used for emergencies and in critical events for first responders and for other social good use cases.

Dave Josephsen runs the telemetry engineering team at sparkpost. He thinks you’re pretty great.

Presentations

Schema On Read and the New Logging Way Session

This is the story of how Sparkpost Reliability Engineering abandoned ELK for a DIY Schema-On-Read logging infrastructure. We share architectural details and tribulations from our _Internal Event Hose_ data ingestion pipeline project, which uses Fluentd, Kinesis, Parquet and AWS Athena to make logging sane.

Yiannis Kanellopoulos is the founder of Code4Thought, a startup striving to promote the concepts of algorithms’ accountability and transparency in the Machine Learning world. He holds a Ph.D. in the area of Data Mining from the University of Manchester, UK. Yiannis has been working for more than 15 years in bringing transparency on the way software was being developed from a technical point of view. He is also a founding member of Orange Grove Patras, a business incubator sponsored by the Dutch Embassy in Greece to promote entrepreneurship and counter youth unemployment.

Presentations

On the Accountability of Black Boxes: How we can control what we can’t exactly measure. Findata

Black box algorithmic models make decisions that have a great impact in our lives. Thus, the need for their accountability and transparency is growing. To address this, we have created an evaluation framework for models and the organisations utilising them.This session presents the aspects of our framework and the lessons learnt from its application at a multibillion dollar high tech corporation.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Autoscaling Spark on Kubernetes Session

In the Kubernetes world where declarative resources are a first class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice -- we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova & Holden Karau for a fun adventure.

Cross-Cloud Model Training and Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Improving Spark Down Scaling: Or not throwing away all of our work Session

As more workloads move to “severless” like environments, the importance of properly handling downscaling increases.

Rohit Karlupia has been mainly writing high performance server applications, ever since completing his Bachelors of Technology in Computer Science and Engineering from IIT Delhi in 2001. He has deep expertise in the domain of messaging, API gateways and mobile applications. His primary research interests are performance and scalability of cloud applications. At Qubole, his primary focus is making Big Data as a Service, debuggable, scalable and performant. His current work includes SparkLens (open source Spark profiler), GC/CPU aware task scheduling for spark and Qubole Chunked Hadoop File System.

Presentations

Scalability aware autoscaling of spark application Session

Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Scalability aware autoscaling aims to use historical information to make better scaling decisions. In this talk we will talk about (1) Measuring efficiency of autoscaling policies and (2) coming up with more efficient autoscaling policies, in terms of latency and costs.

Until recently, Arun Kejariwal was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Architecture and Algorithms for End-to-End Streaming Data Processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). In this tutorial we shall lead the audience through a journey of the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline - messaging, compute and storage - for real-time data and algorithms to extract insights - e.g., heavy-hitters, quantiles - from data streams.

Model serving via Pulsar Functions Session

In this talk, we shall walk the audience through an architecture whereby models are served in real-time and the models are updated, using Apache Pulsar, without restarting the application at hand. Further, we will describe how Pulsar functions can be applied to support two example use cases, viz., sampling and filtering. We shall lead the audience through a concrete case study of the same.

Sequence-2-Sequence Modeling for Time Series Session

Recently, Sequence-2-Sequence has also been used for applications based on time series data. In this talk, we first overview S2S and the early use cases of S2S. Subsequently, we shall walk through how S2S modeling can be leveraged for the aforementioned use cases, viz., real-time anomaly detection and forecasting.

Ganes is a co-founder at Gramener, where he heads Analytics and Innovation in Data science.

Ganes advises enterprises on deriving value from data science initiatives, and leads applied research in deep learning at Gramener AI Labs. He is passionate about the confluence of machine learning, information design and data-driven business leadership.

He strives to simplify and demystify data science. More about his data pursuits and writing can be found at https://gkesari.com.

Presentations

AI for Social Good: Saving the planet through data science DCS

Global environmental challenges have pushed our planet to the brink of disaster. Rapid advances in deep learning are placing immense power in the hands of consumers and enterprises. This power can be marshaled to support environmental groups and researchers who need immediate assistance to address the rapid depletion of our rich biodiversity.

Jay is currently the head of Analytics Practice at Bowery Analytics LLC. He works with various clients to devise predictive analytics strategies for executive decision makers. This involves advising companies on different modeling techniques, data transformation and visualization needs, the software and human resources needed to execute analytics projects.

He has spent the last 14 years working with fortune 100 clients across industries executing large scale transformation projects in CRM, Order Management, Pricing Engines, Customer Management systems and advanced Marketing solutions.

He holds a BS in Computer Science from Andrews Univ., MS in Business Analytics from NYU and MS in Tech Mgmt. from Columbia University at New York and a Certificate in Leadership from IE, Madrid.

Presentations

Evaluating cyber security defenses with a data science approach Session

Cyber security analysts are under siege to keep pace with the ever-changing threat landscape. The analysts are overworked, burnout and bombarded with the sheer number of alerts that they must carefully investigate. To empower our cyber security analysts we can use a data science model for alert evaluations.

Seonmin Kim is a senior data risk analyst at LINE where he is a key member of the Trust and Safety team that handles payment fraud and content abuse using data analytics.
He has over 9 years of extensive experience in identifying fraud and abuse risk across various business domains.
His primary focus is on AI and machine learning for payment fraud and abuse risk.

Presentations

How to mitigate mobile fraud risk by data analytics Session

Kim will provide an introduction to activities that mitigate the risk of mobile payments through various data analytical skills which came out of actual case studies of mobile frauds, along with tree-based machine learning, graph analytics, and statistical approaches.

Melinda King is a Google Authorized Trainer at ROI Training, 2017’s Google Cloud Training Partner of the Year. Melinda brings 30+ years of progressive experience with a unique combination of technical, managerial, and organizational skills. She has done solution design, development, and implementation using Google products including Compute Engine, App Engine, Kubernetes, Bigtable, Spanner, BigQuery, Pub/Sub, Dataflow, and Dataproc. Her expertise includes applying data science algorithms on big data to produce insights for optimizing business decisions. Melinda is also a Microsoft Certified Trainer with certifications for Azure, SQL Server, and Data Management and Analytics. Melinda spent 20+ years serving as a member of the US Marine Corps.

Presentations

Serverless Machine Learning with TensorFlow, Part I Tutorial

This tutorial provides an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hand-ons labs, you’ll learn machine learning (ML) and TensorFlow concepts, and develop skills in developing, evaluating, and productionizing ML models.

Serverless Machine Learning with TensorFlow, Part II Tutorial

This tutorial provides an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hand-ons labs, you’ll learn machine learning (ML) and TensorFlow concepts and develop skills in developing, evaluating, and productionizing ML models.

Mikayla is a software engineer at Google on the Cloud Dataproc team. She helped launch Dataproc’s High Availability mode and the Workflow Templates API. She is currently working on improvements to shuffle and autoscaling.

Presentations

Improving Spark Down Scaling: Or not throwing away all of our work Session

As more workloads move to “severless” like environments, the importance of properly handling downscaling increases.

Gabor Kotalik is a big data project lead at Deutsche Telekom, where he’s responsible for continuous improvement of customer analytics and machine learning solutions for commercial roaming business. He has more than 10 years of experience in business intelligence and advanced analytics focusing on utilization of insights and enabling data-driven business decisions.

Presentations

Data Science in Deutsche Telekom - Predicting global travel patterns and network demand Session

The knowledge of location and travel patterns of customers is important for many companies. One of them is a German telco service operator Deutsche Telekom. Commercial Roaming project using Cloudera Hadoop helped the company to better analyze the behavior of its customers from 10 countries, in a very secure way, to be able to provide better predictions and visualizations for the management.

Cassie Kozyrkov is Google Cloud’s chief decision scientist. Cassie is passionate about helping everyone make better decisions through harnessing the beauty and power of data. She speaks at conferences and meets with leadership teams to empower decision makers to transform their industries through AI, machine learning, and analytics. At Google, Cassie has advised more than a hundred teams on statistics and machine learning, working most closely with research and machine Intelligence, Google Maps, and ads and commerce. She has also personally trained more than 15,000 Googlers (executives, engineers, scientists, and even nontechnical staff members) in machine learning, statistics, and data-driven decision making. Previously, Cassie spent a decade working as a data scientist and consultant. She is a leading expert in decision science, with undergraduate studies in statistics and economics at the University of Chicago and graduate studies in statistics, neuroscience, and psychology at Duke University and NCSU. When she’s not working, you’re most likely to find Cassie at the theatre, in an art museum, exploring the world, playing board games, or curled up with a good novel.

Presentations

Making Data Science Useful Keynote

Despite the rise of data engineering and data science functions in today's corporations, leaders report difficulty in extracting value from data. Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness and hiring experts doesn’t seem to help. Let’s talk about how you can change that!

Mounia Lalmas is a Director of Research at Spotify, and the Head of Tech Research in Personalization. Mounia also holds an honorary professorship at University College London. Before that, she was a Director of Research at Yahoo, where she led a team of researchers working on advertising quality for Gemini, Yahoo native advertising platform. She also worked with various teams at Yahoo on topics related to user engagement in the context of news, search, and user generated content. Her work focuses on studying user engagement in areas such as native advertising, digital media, social media, search, and now music. She has given numerous talks and tutorials on these and related topics. She is also the co-author of a book written as the outcome of her WWW 2013 tutorial on “measuring user engagement”.

Presentations

Recommending and Searching (Research @ Spotify) Session

The aim of our mission is "to match fans and artists in a personal and relevant way". In this talk, Mounia will describe some of the (research) work we are doing to achieve this, from using machine learning to metric validation. She will describe works done in the context of Home and Search.

Francesca Lazzeri is an AI and machine learning scientist on the cloud developer advocacy team at Microsoft. Francesca has multiple years of experience as data scientist and data-driven business strategy expert; she is passionate about innovations in big data technologies and the applications of machine learning-based solutions to real-world problems. Her work on these issues covers a wide range of industries, including energy, oil and gas, retail, aerospace, healthcare, and professional services. Previously, she was a research fellow in business economics at Harvard Business School, where she performed statistical and econometric analysis within the Technology and Operations Management Unit and worked on multiple patent data-driven projects to investigate and measure the impact of external knowledge networks on companies’ competitiveness and innovation. Francesca is a mentor for PhD and postdoc students at the Massachusetts Institute of Technology and enjoys speaking at academic and industry conferences to share her knowledge and passion for AI, machine learning, and coding. Francesca holds a PhD in innovation management.

Presentations

Cross-Cloud Model Training and Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Time Series Forecasting with Azure Machine Learning service Tutorial

Time series modeling and forecasting has fundamental importance to various practical domains and, during the past few decades, machine learning model-based forecasting has become very popular in the private and the public decision-making process. In this tutorial, we will walk you through the core steps for using Azure Machine Learning to build and deploy your time series forecasting models.

Randy Lea is chief revenue officer at Arcadia Data, where he is charged with leading the company’s sales momentum. Randy is passionate about solving customer problems by leveraging analytics and data. An early participant in the data warehouse and BI analytics market, he has held leadership positions at companies including Aster Data, Think Big Analytics, and Teradata. Randy holds a bachelor’s degree in marketing from California State University, Fullerton.

Presentations

Intelligent Design Patterns for Cloud-Based Analytics and BI (sponsored by Arcadia Data) Session

With cloud object storage (e.g. S3, ADLS), one expects business intelligence (BI) applications to benefit from the scale of data and real-time analytics. However, traditional BI in the cloud surfaces non-obvious challenges. This talk will review service-oriented cloud design (storage, compute, catalog, security, SQL) and shows how native cloud BI provides analytic depth, low cost, and performance.

Sun is working in Equinor as an Leading Engineer within Enterprise Data Management. She holds 3 years of Data Management experience from the Norwegian Hydrographic office and 7 years drilling services experience prior to Data Management positions for Statoil since 2008. This includes advisory posistions since 2011 and member in the Blue Book Work Group and Diskos Well Committe. In Enterprise Data Management the focus is now shifted to the whole company. Sun holds an M.Sc in Petroleum GeoScience from NTNU in 1998.

Presentations

Architecting a data platform to support analytic workflows for scientific data Session

In Upstream Oil and Gas, a vast amount of the data requested for analytics projects is “scientific data” - physical measurements about the real world. Historically this data has been managed “library-style” in files - but to provide this data to analytics projects, we need to do something different. Sun and Jane discuss architectural best practices learned from their work with subsurface data.

Implementing Enterprise Data Management in Industrial and Scientific organisations Session

Implementing Enterprise Data Management is never easy, but it's even harder in industrial and scientific organisations. Three worlds of business data, facilities data and scientific data have long been managed separately but must be brought together to realise business value. Sun and Jane will address the cultural and organisational differences as well as data management requirements to succeed.

14 years of experience in the IT universe within different sectors and technologies.
Strong expertise to define and develop a Big Data Architecture to processing batch and streaming approach.

Presentations

The vindication of Big data. How Hadoop is used in Santander UK to defend privacy. Session

Big data is usually regarded as a menace for data privacy. However, with the right principles and mind-set, it can be a game changer to put customers first and consider data privacy an inalienable right. Santander UK applied this model to comply with GDPR by using graph technology, Hadoop, Spark, Kudu to drive data obscuring and data portability, and driving machine learning exploration.

Brennan is a self-proclaimed data nerd. He has been working in the financial industry for the past 10 years and is striving to save the world with a little help from our machine friends.

He has held cyber security, data scientist, and leadership roles at JP Morgan Chase, the Federal Reserve Bank of New York, Bloomberg, and Goldman Sachs. Brennan holds a masters’ degree in Business Analytics from New York University and participates in the data science community with his non-profit pro-bono work at DataKind, and as a co-organizer for the NYU Data Science and Analytics Meetup.

Brennan is also an instructor at the New York Data Science Academy and teaches data science courses in R and Python.

Presentations

Evaluating cyber security defenses with a data science approach Session

Cyber security analysts are under siege to keep pace with the ever-changing threat landscape. The analysts are overworked, burnout and bombarded with the sheer number of alerts that they must carefully investigate. To empower our cyber security analysts we can use a data science model for alert evaluations.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience. He enjoys intelligent design and engaging storytelling and is passionate about data, music, and nature.

Presentations

Building a Serverless Big Data Application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this workshop, we show you how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You will build a big data application using AWS technologies such as S3, Athena, Kinesis, and more

Building a Serverless Big Data Application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this workshop, we show you how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You will build a big data application using AWS technologies such as S3, Athena, Kinesis, and more

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday keynote welcome Keynote

Program Chairs, Ben Lorica, Doug Cutting, and Alistair Croll, welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the first day of keynotes.

David Low is currently the Co-founder and Chief Data Scientist at Pand.ai, building AI-powered chatbot to disrupt and shape the booming conversational commerce space with Deep Natural Language Processing. He represented Singapore and National University of Singapore (NUS) in Data Science Game’16 at France and clinched top spot among Asia and America teams. Recently David has been invited as a guest lecturer by NUS to conduct masterclasses on applied Machine Learning and Deep Learning topics. Prior to Pand.ai, he was a Data Scientist with Infocomm Development Authority (IDA) of Singapore.

Throughout his career, David has engaged in data science projects ranging from Manufacturing, Telco, E-commerce to Insurance industry. Some of his works including sales forecast modeling and influencer detection had won him awards in several competitions and was featured on IDA website and NUS publication. Earlier in his career, David was involved in research collaborations with Carnegie Mellon University (CMU) and Massachusetts Institute of Technology (MIT) on separate projects funded by National Research Foundation and SMART. As a pastime activity, he competed on Kaggle and achieved Top 0.2% worldwide ranking.

Presentations

The Unreasonable Effectiveness of Transfer Learning on NLP Session

Transfer Learning has been proven to be a tremendous success in the Computer Vision field as a result of ImageNet competition. In the past months, the Natural Language Processing field has witnessed several breakthroughs with transfer learning, namely ELMo, OpenAI Transformer, and ULMFit. In this talk, David will be showcasing the use of transfer learning on NLP application with SOTA accuracy.

Feng Lu is a software engineer at Google, and also the tech lead and manager for Cloud Composer. He joined Google in 2014 after completed his PhD in UC San Diego where his research work was reported by MIT Technology Review among others. He has a broad interest in cloud and big data analytics.

Presentations

Migrating Apache Oozie Workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems where the former focuses on Apache Hadoop jobs. We see a need to build oozie to Airflow workflow mapping as a part of creating an effective cross-cloud/cross-system solution. This talk aims to introduce an open-source Oozie-to-Airflow migration tool developed at Google.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Over his career, he has been responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on Machine Learning with Kafka-based Streaming Pipelines Tutorial

This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques.

Angie Ma is cofounder and COO of ASI Data Science, a London-based AI tech startup that offers data science as a service, which has completed more than 120 commercial data science projects in multiple industries and sectors and is regarded as the EMEA-based leader in data science. Angie is passionate about real-world applications of machine learning that generate business value for companies and organizations and has experience delivering complex projects from prototyping to implementation. A physicist by training, Angie was previously a researcher in nanotechnology working on developing optical detection for medical diagnostics.

Presentations

AI for managers 2-Day Training

Angie Ma and Jonny Howell offer a condensed introduction to key AI and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

AI for managers (Day 2) Training Day 2

Angie Ma offers a condensed introduction to key AI and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Swetha Machanavajhala is a software engineer for Azure Networking at Microsoft, where she builds tools to help engineers detect and diagnose network issues within seconds. She is very passionate about building products and awareness for people with disabilities and has led several related projects at hackathons, driving them from idea to reality to launching as a beta product and winning multiple awards. Swetha is a co-lead of the Disability Employee Resource Group, where she represents the community of people who are deaf or hard of hearing, and is a part of the ERG chair committee. She is also a frequent speaker at both internal and external events.

Presentations

Inclusive Design: Deep Learning on audio in Azure, identifying sounds in real-time. Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. We will explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.

Mark Madsen is the global head of architecture at Think Big Analytics, where he is responsible for understanding, forecasting, and defining the analytics landscape and architecture. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning and vendors on product management. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Romi Mahajan is CEO of KKM Group, an advisory and investment firm with interests in over 35 technology based companies. He is a Marketer, Author, Activist, and Philanthropist based in Bellevue, WA, USA.

Romi spent a decade at Microsoft, and has been Chief Marketing Officer of 5 companies.

Presentations

Real Estate, Real AI: Insights and Decisions in the World's Largest Asset Class Findata

Residential Real Estate is the world's largest asset class. More importantly, "dwellings" constitute the single largest purchase for most families around the globe. Still, in the world's largest residential real estate markets, the process of valuing, buying, and selling houses is byzantine, analog, and mysterious. Using sophisticated and real-world AI is the key to democratizing value.

I am a Data Architect and Data Scientist with 13+ years of experience in building extremely large Data Warehouses and Analytical solutions. I have worked extensively on Hadoop, DI and BI Tools, Data Mining and Forecasting, Data Modeling, Master and Metadata Management and Dashboard tools. I am proficient in Hadoop, SAS, R, Informatica, Teradata & Qlikview. I participate on Kaggle Data Mining competitions as a hobby,

Presentations

Scaling Impala - Common Mistakes and Best Practices` Session

Apache Impala is a MPP SQL query engine for planet scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. In this talk, we will discuss how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and antipatterns for end users or BI applications.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Foundations for Successful Data Projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Mastering Stream & Pipelines: Designing and support the nervous system of your company Session

In the world of data it is all about building the best path to support time/quality to value. 80% to 90% of the work is getting the data into the hands and tools that can create value. This talk will take us on a journey of different patterns and solution that can work at the largest of companies.

Sundeep is SVP of Product Development at Gramener, a data science company, where he leads a team of data enthusiasts who tell visual stories of insights from analysis. These are built on Gramex, Gramener’s data science in a box platform. Previously, Sundeep worked at Comcast Cable, NeoTech Solutions, Birlasoft Inc, Wipro Technologies and worked as Consultant for Federal agencies in USA and India. He holds an Electrical Engineering bachelors with an MBA in IT & Marketing.

Presentations

India's Data Dilemma with India Stack Session

Answering simple question of what rights do Indian citizens have over their data is a nightmare. The rollout of India Stack technology based solutions has added fuel to fire. Sundeep explains, with on ground examples, how businesses and citizens are navigating the India Stack ecosystem while dealing with Data privacy, security & Ethics space in India's booming digital economy.

James Malone is a product manager for Google Cloud Platform and manages Cloud Dataproc and Apache Beam (incubating). Previously, James worked at Disney and Amazon. James is a big fan of open source software because it shows what is possible when people come together to solve common problems with technology. He also loves data, amateur radio, Disneyland, photography, running, and Legos.

Presentations

Migrating Apache Oozie Workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems where the former focuses on Apache Hadoop jobs. We see a need to build oozie to Airflow workflow mapping as a part of creating an effective cross-cloud/cross-system solution. This talk aims to introduce an open-source Oozie-to-Airflow migration tool developed at Google.

Founder, CEO and CTO.

David is a serial entrepreneur, with his most previous company – Hexatier/GreenSQL – acquired by Huawei. He was a founder of Precos, Vanadium-soft, GreenCloud, Teridion, Terrasic, and Re-Sec, among others.

Previously a director in Fortinet’s CTO office, he managed information security at Bezeq, the Israeli Telecom.

He has 24 years’ experience in leadership, AI, Cyber security, development and networking and is a veteran of an elite IDF unit.

Named one of the Top-40 Israeli Internet Startup Professionals by TheMarker Magazine and Top 40 under 40 most promising Israeli business professionals by Globes Magazine.

David holds a master’s in computer science from Open University.

Presentations

Signal Processing, Machine Learning & Video Tell the Truth Session

The combination of a mere of a few minutes of video, signal processing, remote heart rate monitoring, machine learning, and data science can identify a person’s emotions, health condition and performance. Financial institutions and potential employers can analyze whether you have good or bad intentions.

Shingai Manjengwa is the Chief Executive Officer at Fireside Analytics Inc., a Canadian ed-tech start-up that develops customized, data science training programs to corporates, governments, and educational institutions. Data Science courses by Fireside Analytics have over 300,000 registered learners on platforms like IBM’s CognitiveClass.ai and Coursera. Shingai as an instructor in Business Analytics and Distributed File Systems in the Data Science Advanced Diploma program at the Metro College of Technology in Toronto.

An IBM Influencer, author and NYU Stern alumni, Shingai is also the founder of Fireside Analytics Academy, a registered private high school (BSID: 886528) that teaches high school students to solve problems with data. The IDC4U – Data Science academic credit is inspected by the Ministry of Education in Canada and teaches data science concepts through youth-focused case studies that combine Business Studies, Computer Programming, and Statistics. The program is available is available to international students online and the curriculum can be licensed to schools, currently in 2 private high schools in Canada.

Presentations

Building Data Science Capacity in your Organization Keynote

Insights from teaching data science to 300,000 online learners, second-career college graduates and, Grade 12 / 6th Form high school students.

Cecilia is a Manager in Jakala with 8+ years of experience in Consulting and MarTech

She helps Retail & FMCG companies in creating a sustainable competitive advantage and increasing their top line by leveraging on data, advanced analytics & AI, location analytics, technologies and experience design.

Presentations

Data-intense profiling of points of consumption to increase sales&marketing effectiveness DCS

A major beverage company was facing a big deal in defining its sales&marketing strategy in France given the big number of points of consumption (POC) and minimum data available to differentiate them. The session shows how Jakala mixed social media data and location data from mobile apps to estimate both the overall attractiveness of the POC and the affinity of its real consumers to brand target...

Jane McConnell is a practice partner for oil and gas within Teradata’s Industrial IoT Group, where she shows oil and gas clients how analytics can provide strategic advantage and business benefits in the multimillions. Jane is also a member of Teradata’s IoT core team, where she sets the strategy and positioning for Teradata’s IoT offerings and works closely with Teradata Labs to influence development of products and services for the industrial space. Originally from an IT background, Jane has also done time with dominant oil industry market players such as Landmark and Schlumberger in R&D, product management, consulting, and sales. In one role or another, she has influenced information management projects for most major oil companies across Europe. She chaired the education committee for the European oil industry data management group ECIM, has written for Forbes, and regularly presents internationally at oil industry events. Jane holds a BEng in information systems engineering from Heriot-Watt University in the UK. She is Scottish and has a stereotypical love of single malt whisky.

Presentations

Architecting a data platform to support analytic workflows for scientific data Session

In Upstream Oil and Gas, a vast amount of the data requested for analytics projects is “scientific data” - physical measurements about the real world. Historically this data has been managed “library-style” in files - but to provide this data to analytics projects, we need to do something different. Sun and Jane discuss architectural best practices learned from their work with subsurface data.

Implementing Enterprise Data Management in Industrial and Scientific organisations Session

Implementing Enterprise Data Management is never easy, but it's even harder in industrial and scientific organisations. Three worlds of business data, facilities data and scientific data have long been managed separately but must be brought together to realise business value. Sun and Jane will address the cultural and organisational differences as well as data management requirements to succeed.

Darragh is a Solution Architect at Kainos, specialising in data engineering. He has been working with data-intensive systems for over a decade and was the founder of Kainos’ Data & Analytics Capability in 2014. He is Kainos’ lead architect for NewDay’s AWS Data Platform. He enjoys working with talented people and like every engineer, loves a technical challenge. In his spare time he is usually up a mountain or in a squash court but lately has developed an unhealthy fascination with unsolved crimes.

Presentations

Cloud-based streams and batches in the PCI - Transforming a Financial Services data infrastructure for the modern era by building a PCI DSS-compliant data platform from the ground up, on AWS. Session

In this session you will learn how we have built a high-performance contemporary data processing platform, from the ground up, on AWS. We will discuss our journey from legacy, onsite, traditional data estate to an entirely cloud-based, PCI DSS-compliant platform.

Michael McCune is a software developer in Red Hat’s Emerging Technology Group, where he develops and deploys application for cloud platforms. He is an active contributor to several radanalytics.io projects and is a core reviewer for the OpenStack API working group. Previously, Michael developed Linux-based software for embedded global positioning systems.

Presentations

Application intelligence: bridging the gap between human expertise and machine learning Session

Artificial intelligence and machine learning are now popularly used terms but how do we make use of these techniques, without throwing away the valuable knowledge of experienced employees. This session will delve into this idea with examples of how distributed machine learning frameworks fit together naturally with business rules management systems.

Hussein joined Google in October 2017 to relaunch the Cloud AI platform products which include Cloud ML Engine, Kubeflow and more to come. Prior to Google, Hussein worked at Facebook from 2012-2017 where he founded Facebook’s AI platform and Applied ML teams which built critical AI solutions and systems for News Feed, Ads, Instagram, Whatsapp, Messenger and many other Facebook products.

Prior to Facebook, Hussein worked on Search and Speech at Bing, Microsoft and received a Masters in Speech Recognition from the University of Cambridge

Presentations

Mass production of AI solutions Session

AI will change how we live in the next 30 years. However, AI is still limited to a small group of companies. Building AI systems is expensive and difficult. But in order to scale the impact of AI across the globe, we need to reduce the cost of building AI solutions? How can we do that? Can we learn from other industries? Yes, we can. The automobile industry went through a similar cycle.

A Big Data Engineer at the Nielsen Marketing Cloud. I specialize in research and development of solutions for big data infrastructures using cutting-edge technologies such as Spark, Kafka and ElasticSearch.

Presentations

Nielsen Presents: Fun with Kafka, Spark and Offset Management Session

Ingesting billions of events per day into our big data stores we need to do it in a scalable, cost-efficient and consistent way. When working with Spark and Kafka the way you manage your consumer offsets has a major implication on data consistency. We will go in depths of the solution we ended up implementing and discuss the working process, the dos and don'ts that led us to its final design.

Cameron is a Senior Computer Science student at Truman State University in Missouri. I am currently a research intern at Google under the Cloud Composer team, and have had 2 internships there previously. I have a passion for open source projects, with a recent interest in Apache Airflow and Apache Oozie.

Presentations

Migrating Apache Oozie Workflows to Apache Airflow Session

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems where the former focuses on Apache Hadoop jobs. We see a need to build oozie to Airflow workflow mapping as a part of creating an effective cross-cloud/cross-system solution. This talk aims to introduce an open-source Oozie-to-Airflow migration tool developed at Google.

Robin is a Developer Advocate at Confluent, the company founded by the creators of Apache Kafka, as well as an Oracle Developer Champion and ACE Director Alumnus. His career has always involved data, from the old worlds of COBOL and DB2, through the worlds of Oracle and Hadoop, and into the current world with Kafka. His particular interests are analytics, systems architecture, performance testing and optimization. He blogs at http://cnfl.io/rmoff and http://rmoff.net/ (and previously http://ritt.md/rmoff) and can be found tweeting grumpy geek thoughts as @rmoff. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.

Presentations

Real-time SQL Stream Processing at Scale with Apache Kafka and KSQL Tutorial

In this workshop you will learn the architectural reasoning for Apache Kafka and the benefits of real-time integration, and then build a streaming data pipeline using nothing but your bare hands, Kafka Connect, and KSQL.

The Changing Face of ETL: Event-Driven Architectures for Data Engineers Session

This talk discusses the concepts of events, their relevance to software and data engineers and their ability to unify architectures in a powerful way. It describes why analytics, data integration and ETL fit naturally into a streaming world. There'll be a hands-on demonstration of these concepts in practice and commentary on the design choices made.

Ines is a developer specialising in applications for AI technology. She’s the co-founder of Explosion AI and a core developer of spaCy, one of the most popular libraries for Natural Language Processing, and Prodigy, an annotation tool for radically efficient machine teaching.

Presentations

Practical NLP transfer learning with spaCy and Prodigy Scale Session

In this talk, I'll explain spaCy's new support for efficient and easy transfer learning, and show you how it can kickstart new NLP projects with our new annotation tool, Prodigy Scale.

I am a passionate ex data scientist that moved over to the business side a few years back due to lack of receivers that understood the importance in a data driven business. Since then I have created new data driven offerings, acting as the lead architect behind Sweden’s Strategic Innovation Program related to transportation, www.drivesweden.net/en. Drive Sweden consists of more than 90 global partners in the area of transportation with the purpose to set new de facto standard way of working and a digital infrastructure worthy the fourth industrial revolution.

Presentations

The digital truth and the physical twin DCS

This is a practical presentation of how the fourth industrial revolution are transforming companies and business models as we know it. The truth is no longer what you see with your eyes, the truth is in the digital sphere, where it only sometimes will be a need for a physical twin. What is the need for a road sign along the street if the information is already in the car.

I worked for the past six years as an engineer on various Adobe Marketing Cloud solutions, where I got to experiment with Mobile, Video and Backend development.
When dealing with Adobe applications serving 23 billion requests per day, some serious muscles need to be flexed. To make it possible to also deploy new versions of our backend applications while the plane is flying, we need extremely precise and reliable tools to do it fast and with minimal human intervention. This is the area that my team is focusing on, offering infrastructure automatization and fast deployments in Adobe Audience Manager.
Outside business hours, I love playing pool and enjoy a good book.

Presentations

Deploying your realtime apps on thousands of servers and still being able to breathe Session

Obtaining servers to run your realtime application has never been easier. Cloud providers have removed the cumbersome process of provisioning new hardware, to suite your needs. What happens though when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers in a fast and reliable way with minimal human intervention? This session addresses this precise topic.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Previously, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive.

Presentations

Running SQL-based workloads in the cloud at 20x-200x Lower Cost Using Apache Arrow Session

Performance and cost are two important considerations in determining optimized solutions for SQL workloads in the cloud. We look at TPC workloads and how they can be accelerated, invisible to client apps. We explore how Apache Arrow, Parquet, and Calcite can be used to provide a scalable, high-performance solution optimized for cloud deployments, while significantly reducing operational costs.

Paco Nathan is known as a “player/coach”, with core expertise in data science, natural language processing, machine learning, cloud computing; 35+ years tech industry experience, ranging from Bell Labs to early-stage start-ups. Co-chair JupyterCon and Rev. Advisor for Amplify Partners, Deep Learning Analytics, Recognai, Data Spartan. Recent roles: Director, Learning Group @ O’Reilly Media; Director, Community Evangelism @ Databricks and Apache Spark. Cited in 2015 as one of the Top 30 People in Big Data and Analytics by Innovation Enterprise.

Presentations

Executive Briefing: Overview of Data Governance Session

Data governance is an almost overwhelming topic. This talk surveys history, themes, plus a survey of tools, process, standards, etc. Mistakes imply data quality issues, lack of availability, and other risks that prevent leveraging data. OTOH, compliance issues aim to preventing risks of leveraging data inappropriately. Ultimately, risk management plays the "thin edge of the wedge" in enterprise.

Dr Sami Niemi has been working on Bayesian inference and machine learning over 10 years and have published peer reviewed papers in astrophysics and statistics. He has delivered machine learning models for e.g. telecommunications and financial services. Sami has built supervised learning models to predict customer and company defaults, 1st and 3rd party fraud, customer complaints, and used natural language processing for probabilistic parsing and matching. He has also used unsupervised learning in a risk based anti-money laundering application. Currently Sami works at Barclays where he leads a team of data scientists building fraud detection models and manages the UK fraud models.

Presentations

Predicting Real-Time Transaction Fraud Using Supervised Learning Session

Predicting transaction fraud of debit and credit card payments in real-time is an important challenge, which state-of-art supervised machine learning models can help to solve. Barclays has been developing and testing different solutions and will show how well different models perform in variety of situations like card present and card not present debit and credit card transactions.

Kris Nova is a senior developer advocate at Heptio focusing on containers, infrastructure, and Kubernetes. She is also an ambassador for the Cloud Native Computing Foundation. Previously, Kris was a developer advocate and an engineer on Kubernetes in Azure at Microsoft. She has a deep technical background in the Go programming language and has authored many successful tools in Go. Kris is a Kubernetes maintainer and the creator of kubicorn, a successful Kubernetes infrastructure management tool. She organizes a special interest group in Kubernetes and is a leader in the community. Kris understands the grievances with running cloud-native infrastructure via a distributed cloud-native application and recently authored an O’Reilly book on the topic: Cloud Native Infrastructure. Kris lives in Seattle, WA, and spends her free time mountaineering.

Presentations

Autoscaling Spark on Kubernetes Session

In the Kubernetes world where declarative resources are a first class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice -- we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova & Holden Karau for a fun adventure.

Eoin is currently a Lead Data engineer at Newday. For the past couple of years he has worked as part of Newday’s digital transformation, specifically in bringing in and enabling new data capabilities. He previously worked at data analytics firm Dunnhumby, where he worked in several roles across Data, IT and Architecture.

Presentations

Cloud-based streams and batches in the PCI - Transforming a Financial Services data infrastructure for the modern era by building a PCI DSS-compliant data platform from the ground up, on AWS. Session

In this session you will learn how we have built a high-performance contemporary data processing platform, from the ground up, on AWS. We will discuss our journey from legacy, onsite, traditional data estate to an entirely cloud-based, PCI DSS-compliant platform.

Brian O’Neill is the founder and consulting product designer at Designing for Analytics, where he focuses on helping companies design indispensable data products that customers love. Brian’s clients and past employers include Dell EMC, NetApp, TripAdvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JPMorgan Chase, the Future of Music Coalition, and E*TRADE, among others; over his career, he has worked on award-winning storage industry software for Akorri and Infinio. Brian has been designing useful, usable, and beautiful products for the web since 1996. Brian has also brought over 20 years of design experience to various podcasts, meetups, and conferences such as the O’Reilly Strata Conference in New York City and London, England. He is the author of the Designing for Analytics Self-Assessment Guide for Non-Designers as well numerous articles on design strategy, user experience, and business related to analytics. Brian is also an expert advisor on the topics of design and user experience for the International Institute for Analytics. When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage performing as a professional percussionist and drummer. He leads the acclaimed dual-ensemble Mr. Ho’s Orchestrotica, which the Washington Post called “anything but straightforward,” and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival. If you’re at a conference, just look for only guy with a stylish orange leather messenger bag.

Presentations

Empathy: The Secret Ingredient in the Design of Engaging Data Products and Analytics Tools Session

In 2019, Gartner predicted 80%+ of analytics insights won’t deliver outcomes through 2022—despite sizable tech investments. While ML, AI, and advanced analytics remain in the "hype" cycle, data teams still struggle to design engaging, valuable decision support tools that customers love. Why? Solutions are too often data-first and human-second, obfuscating the real problems begging to be solved.

Cait O’Riordan is the FT’s Chief Product and Information Officer (CPIO). She is responsible for platform and product strategy, development and operations across the FT Group, working in close partnership with editorial and commercial teams. She is on the FT executive board, which is responsible for the company’s global strategy and performance. Before joining the FT in February 2016 Cait led the BBC’s digital product development for the London 2012 Olympics and played a central role in the user and revenue growth of music app company Shazam.

Presentations

Keynote with Cait O'Riordan Keynote

Chief Product and Information Officer at the Financial Times

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: From the Edge to AI - Taking Control of your Data for Fun and Profit Session

It's easier than ever to collect data -- but managing it securely, in compliance with regulations and legal constraints is harder. There are plenty of tools that promise to bring machine learning techniques to your data -- but choosing the right tools, and managing models and applications in compliance with regulation and law is quite difficult.

Jerry Overton is a Data Scientist and Fellow in DXC’s Analytics group. He is the global lead for Artificial Intelligence at DXC.

Jerry is the author of the O’Reilly Media eBook Going Pro in Data Science: What It Takes to Succeed as a Professional Data Scientist. He teaches the Safari live online training course Mastering Data Science at Enterprise Scale: How to design and implement machine-learning solutions that improve your organization. In his blog, Doing Data Science, Jerry shares his experiences leading open research and transforming organizations using data science.

Presentations

How to Keep Ethical with Machine Learning Session

Machine-learning algorithms are good at learning new behaviors, but bad at identifying when those behaviors are harmful or don’t make sense. Bias, ethics, and fairness is a big risk factor in Machine Learning (ML). We have a lot of experience dealing with intelligent beings—one another. In this talk, we use this common sense to build a checklist for protecting against ethical violations with ML.

Laila is currently a lawyer practicing technology and privacy law at GTC Law Professional Corp. She is also a software applications engineer. She previously held positions at ExxonMobil and Capstone Technology where she designed and implemented machine learning (AI) software solutions to optimize industrial processes. She routinely advises both Fortune 100 and start-up clients on all aspects of the development and commercialization of their technology solutions (including big data/predictive modelling/machine learning) in diverse industries including fintech, healthcare, and the automotive industry. She is a steering committee member of the Toronto Machine Learning Symposium and will be a panel member discussing responsible AI innovation in November. She has spoken most recently at the Global Blockchain Conference (“Smart Contract Management & Innovation”), the Healthcare Blockchain in Canada conference (“How Blockchain Can Solve Healthcare Challenges”) and the Linux FinTech Forum (“Smart Money Bets on Open Source Adoption in AI/ML Fintech Applications”). Laila will be faculty for the upcoming Osgoode Certificate in Blockchains, Smart Contracts and the Law (November 2018). Laila holds a B.A.Sc. in Chemical Engineering from the University of Toronto, a M.A.Sc. in Chemical Engineering from the University of Waterloo and a J.D. from the University of Toronto, where she was a law review editor. She is admitted to practice in New York and Ontario. She is also a Certified Information Privacy Professional (Canada) (CIPP/C).

Presentations

Responsible AI Innovation Session

As companies commercialize novel applications of AI in areas such as finance, hiring, and public policy, there is concern that these automated decision-making systems may unconsciously duplicate social biases, with unintended societal consequences. This talk will provide practical advice for companies to counteract such prejudices through a legal and ethics based approach to innovation.

Yves Peirsman is the founder and Natural Language Processing expert at NLP Town. Yves started his career as a PhD student at the University of Leuven and a post-doctoral researcher at Stanford University. Since he made the move from academia to industry, he has gained extensive experience in consultancy and software development for NLP projects in Belgium and abroad.

Presentations

Dealing with Data Scarcity in Natural Language Processing Session

In this age of big data, NLP professionals are all too often faced with a lack of data: written language is abundant, but labelled texts are much harder to get by. In my talk, I will discuss the most effective ways of addressing this challenge: from the semi-automatic construction of labelled training data to transfer learning approaches that reduce the need for labelled training examples.

Nick Pentreath is a principal engineer in IBM’s Center for Open-source Data & AI Technology (CODAIT), where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Building a Secure and Transparent ML Pipeline Using Open Source Technologies Session

The application of AI algorithms in domains such as criminal justice, credit scoring, and hiring holds unlimited promise. At the same time, it raises legitimate concerns about algorithmic fairness. There is a growing demand for fairness, accountability, and transparency from machine learning (ML) systems. In this talk we cover how to build just such a pipeline leveraging open source tools.

Dirk is a Head of Engineering and Data Science at Zalando, Europe’s leading Fashion Platform. Trained as a data scientist he enables his five development teams to revolutionize Online Marketing steering in a fully automated, ROI driven, personalized way. In his spare time Dirk is hacking functional Scala and reading through O’Reilly’s online library 10 books at a time.

Presentations

Insights from Engineering Europe's Largest Marketing Platform for Fashion Session

Case Study from Europe’s leading online fashion platform Zalando about its journey to a scalable, personalized Machine Learning based marketing platform.

Thomas Phelan is cofounder and chief architect of BlueData. Previously, Tom was an early employee at VMware; as senior staff engineer, he was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular pluggable storage architecture. He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, Tom was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64-bit filesystem.

Presentations

Deep learning with TensorFlow and Spark using GPUs and Docker containers Session

Organizations need to keep ahead of their competition by using the latest AI/ML/DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. This session will discuss the effective deployment of such applications in a container environment.

Willem leads the Data Science Platform Team at GO-JEK. His main focus areas are building data and ML platforms, allowing organizations to scale machine learning and drive decision making.

The GO-JEK ML platform supports a wide variety of models and handles over 100 million orders every month. Models include recommendation systems, driver allocation, forecasting, anomaly detection, route selection, and more.

In a previous life Willem founded and sold a networking startup and worked as a software engineer in industrial control systems.

Presentations

Unlocking insights in AI by building a feature store Session

Features are key to driving impact with AI at all scales. By democratizing the creation, discovery, and access of features through a unified platform, organizations are able to dramatically accelerate innovation and time to market. Find out how GOJEK, Indonesia's first billion-dollar startup, unlocked insights in AI by building a feature store called Feast, and the lessons they learned along the..

Dan is a Site Reliability Engineer in the Adobe Audience Manager team, that’s lately focused on creating and deploying continuous delivery pipelines for applications within the project. Dealing with all aspects of the automation process from instance provisioning to application deployments. Passionate about technology and recently, about programming in general. Also love playing video games.

Presentations

Deploying your realtime apps on thousands of servers and still being able to breathe Session

Obtaining servers to run your realtime application has never been easier. Cloud providers have removed the cumbersome process of provisioning new hardware, to suite your needs. What happens though when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers in a fast and reliable way with minimal human intervention? This session addresses this precise topic.

Greg is responsible for driving SQL product strategy as part of Cloudera’s data warehouse product team, including working directly with Impala. Over 20 years, Greg has worked with relational database systems across a variety of roles – including software engineering, database administration, database performance engineering, and most recently product management – providing a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

The Future of Cloud-native Data Warehousing: Emerging Trends and Technologies Session

Data warehouses have traditionally run in the data center and in recent years they have adapted to be more cloud-native. In this talk, we'll discuss a number of emerging trends and technologies that will impact how data warehouses are run both in the cloud and on-prem and share our vision on what that means for architects, administrators, and end users.

Vidya leads product management for Machine Learning at Cloudera. Prior to Cloudera, she has helped build highly successful software portfolios in several industry verticals ranging from Telecom, Healthcare, Energy and IoT. Her experience spans early-stage startups, pre-IPO companies to big enterprises. Vidya has a Masters in Business Administration from Duke University.

Presentations

Starting with the end in mind: learnings from data strategies that work Session

Not surprisingly, there is no single approach to embracing data-driven innovations within any industry vertical. However, there are some enterprises that are doing a better job than others when it comes to establishing a culture, process and infrastructure that lends itself to data-driven innovations. In this talk, we will share some key foundational ingredients that span multiple industries.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Architecture and Algorithms for End-to-End Streaming Data Processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). In this tutorial we shall lead the audience through a journey of the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline - messaging, compute and storage - for real-time data and algorithms to extract insights - e.g., heavy-hitters, quantiles - from data streams.

Model serving via Pulsar Functions Session

In this talk, we shall walk the audience through an architecture whereby models are served in real-time and the models are updated, using Apache Pulsar, without restarting the application at hand. Further, we will describe how Pulsar functions can be applied to support two example use cases, viz., sampling and filtering. We shall lead the audience through a concrete case study of the same.

Marc is responsible for leading the research and development of Automatic Data Processing’s (ADP’s) Analytics and Big Data initiative. In this capacity, Marc drives the innovation and thought leadership in building ADP’s Client Analytics platform. ADP Analytics provides its clients not only the ability to read the pulse of its own human capital…but also provides the information on how they stack up within their industry, along with the best courses of action in order to achieve its goals through quantifiable insights. Marc was also an instrumental leader behind the small business market payroll platform; RUN Powered by ADP®. Marc lead a number of the technology teams responsible for delivering its critically acclaimed product focused on its innovative user experience for small business owners.
Prior to joining ADP, Marc’s innovative spirit and fascination with data was forged at Bolt Media; a dot-com start-up based in NY’s “Silicon Alley”. The company was an early predecessor to today’s social media outlets. As an early ‘Data Scientist’; Marc focused on the patterns and predictions of site usage through the harnessing of the data on its +10 million user profiles.

Presentations

The Power of Merging Multi-Functional Expertise to Create Innovative, Data-Driven Products DCS

During this session, Marc will share his experience creating a cross-functional team, discuss the power of listening to others’ points of view, and what you can learn from them, and provide real-world case studies of leaders with varying backgrounds and perspectives who collaborated to take data from analysis to idea to product roll out.

Duncan Ross is Chief Data Officer at Times Higher Education. Duncan has been a data miner since the mid-1990s. Previously at Teradata, Duncan created analytical solutions across a number of industries, including warranty and root cause analysis in manufacturing and social network analysis in telecommunications. In his spare time, Duncan has been a city councillor, chair of a national charity, founder of an award-winning farmers’ market, and one of the founding directors of the Institute of Data Miners. More recently, he cofounded DataKind UK and regularly speaks on data science and social good.

Presentations

Using data for evil V: the AI strikes back Session

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis (and possibly AI). Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse.

Why is it so hard to do AI for Good? Session

DataKind UK has been working in data for good since 2013 working with over 100 uk charities, helping them to do data science for the benefit of their users. Some of those projects have delivered above and beyond expectations - others haven't. In this session Duncan and Giselle will talk about how to identify the right data for good projects...

Nikki Rouda has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Before his current role at Amazon Web Services (AWS), Nikki held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IOT startup.) Nikki has an MBA from Cambridge’s Judge Business School, and a ScB in geophysics from Brown University.

Presentations

Executive Briefing: AWS Technology Trends - Data Lakes and Analytics Session

This talk is about some of the key trends we see in data lakes and analytics, and how they shape the services we offer at AWS. Specific topics include the rise of machine generated data and semi-structured/unstructured data as dominant sources of new data, the move towards serverless, SPI-centric computing, and the growing need for local access to data from users around the world.

Executive briefing: big data in the era of heavy worldwide privacy regulations Session

The General Data Protection Regulation (GDPR) went into place on May 25, 2018, for all organizations—both EU and non-EU—that offer services to EU residents, as well as anyone who controls or processes data within the EU. The State of California is following suit with its California Consumer Protection Act (CCPA), targeted to be enforced in 2020.

S.P.T. Krishnan, PhD is a computer scientist and engineer with 18+ years of professional, research & development experience in Cloud Computing, Big Data Analytics, Machine Learning and Computer Security.

He is recognized as a Google Developer Expert in Google Cloud Platform and an authorized trainer for Google Cloud Platform. Red Hat selected him as “Red Hat Certified Engineer of the Year.” He has architect and developer experience on Amazon Web Services, Google Cloud Platform, OpenStack and Microsoft Azure Platform.

He authored the book "Building Your Next Big Thing with Google Cloud Platform,” and has spoken at both Black Hat and RSA. He is also an adjunct faculty in computer science and taught 500+ university students over 5 years.

He is also a co-founder of Google Developer Group, Singapore. He holds a PhD from the National University of Singapore in Computer Engineering, where he studied the performance characteristics of High Performance Computing algorithms by evaluating them on different multiprocessor architectures.

Presentations

Using AWS Serverless Technologies to Analyze Large Datasets Tutorial

Provides an overview of the latest Big Data and Machine Learning serverless technologies from AWS, and a deep dive into using them to process and analyze two different datasets. The first dataset is publicly available Bureau of Labor Statistics, and the second is Chest X-Ray Image Data.

Neelesh Srinivas Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka.

Presentations

How do you evolve your data infrastructure? Session

Developing data infrastructure is not trivial and neither is changing it. It takes effort and discipline to make changes that can affect your team. In this talk, we shall learn what we, in Stitch Fix's Data Platform team, do to maintain and innovate our infrastructure for our Data Scientists.

Shioulin Sam is a research engineer at Cloudera Fast Forward Labs. In her previous life, she was an angel investor focusing on women-led startups. She also worked in the investment management industry designing quantitative trading strategies. She holds a PhD in electrical engineering and computer science from the Massachusetts Institute of Technology.

Presentations

Learning with Limited Labeled Data Session

Supervised machine learning requires large labeled datasets - a prohibitive limitation in many real world applications. What if machines could learn with fewer labeled examples? This talk explores and demonstrates an algorithmic solution that relies on collaboration between human and machines to label smartly, and discuss product possibilities.

Mark Samson is a principal systems engineer at Cloudera, helping customers solve their big data problems using modern data platforms based on Hadoop. Mark has 20 years’ experience working with big data and information management software in technical sales, service delivery, and support roles.

Presentations

Information Architecture for a Modern Data Platform Session

It is now possible to build a modern data platform capable of storing, processing and analysing a wide variety of data across multiple public and private Cloud platforms and on-premise data centres. This session will outline an information architecture for such a platform, informed by working with multiple large organisations who have built such platforms over the last 5 years.

Danilo Sato is a polyglot principal consultant with more than fifteen years of experience as an architect, data engineer, developer, and agile coach. Balancing strategy with execution, Danilo helps clients refine their technology strategy while adopting practices to reduce the time between having an idea, implementing it, and running it in production using cloud, DevOps and continuous delivery. Danilo authored DevOps In Practice: Reliable and Automated Software Delivery, is a member of ThoughtWorks’ Office of the CTO, and an experienced international conference speaker.

Presentations

Continuous Intelligence: Moving Machine Learning into Production Reliably Tutorial

In this workshop, we will present how to apply the concept of Continuous Delivery (CD) - which ThoughtWorks pioneered - to data science and machine learning. It allows data scientists to make changes to their models, while at the same time safely integrating and deploying them into production, using testing and automation techniques to release reliably at any time and with a high frequency.

Volker Schnecke has almost 20 years experience of working in research and development in pharmaceutical industry. His current role is in late stage clinical development at Novo Nordisk in Denmark, where his focus is on exploiting observational data to support the obesity pipeline. His tasks cover the whole drug discovery and development value chain, from collaborating with preclinical researchers to producing evidence for marketing of new medicines.

Presentations

Using Electronic Health Records to Predict Health Risks Associated with Obesity DCS

Today more than 650 million people worldwide are obese, and most of them will develop additional health issues during their lifetime. However, not all are at equal risk. In this session we will show how we mine Electronic Health Records (EHRs) of millions of patients for understanding the risk in people with obesity and for supporting the discovery of new medicines.

Max Schultze is a Data Engineer currently working on building a Data Lake at Europe’s biggest online fashion retailer, Zalando. His focus lies on building data pipelines at scale of terabytes per day and productionizing Spark and Presto as analytical platforms inside the company. He graduated from the Humboldt University of Berlin, actively taking part in the university’s initial development of Apache Flink.

Presentations

From legacy to cloud: an end to end data integration journey Session

Data Lake implementation at a large scale company, raw data collection, standardized data preparation (e.g. binary conversion, partitioning), user driven analytics and machine learning.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Foundations for Successful Data Projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Tuhin Sharma is co-founder of Binaize Labs, an AI based firm. He worked in IBM Watson and RedHat as Data Scientist where he mainly worked on Social Media Analytics, Demand Forecasting, Retail Analytics and Customer Analytics. He also worked at multiple start ups where he built personalized recommendation systems to maximize customer engagement with the help of ML and DL techniques across multiple domains like FinTech, EdTech, Media, E-comm etc. He has completed his post graduation from Indian Institute of Technology Roorkee in Computer Science and Engineering specializing in Data Mining. He has filed 5 patents and published 4 research papers in the field of natural language processing and machine learning. Apart from this, He loves to play table tennis and guitar in his leisure time. His favourite quote is “Life is Beautiful.” You can tweet him at @iamtuhinsharma.

Presentations

Powering Account Based Marketing for Digital Marketers using Deep Learning Session

The spray and pray model for B2B Digital Marketing is long gone. Companies are now employing very specific and tailored messaging for each key account - as part of their Account-Based Marketing strategy. In this talk, the speakers show how they built deep learning models to identify and nurture accounts throughout the account's customer journey.

Ben is a software engineer at Google working on the Dataproc team improving the experience of autoscaling with Spark.

Presentations

Improving Spark Down Scaling: Or not throwing away all of our work Session

As more workloads move to “severless” like environments, the importance of properly handling downscaling increases.

Rosaria Silipo is a Principal Data Scientist at KNIME. Rosaria holds a doctorate degree in bio-engineering and has spent most of her professional life working on data science projects for a number of different customer companies in a number of different fields, such as for example IoT, customer intelligence, financial industry, cybersecurity.

Presentations

Practicing Data Science: A Collection of Case Studies Session

This is a collection of past data science projects. While the structure is often similar - data collection, data transformation, model training, deployment - each one of them has needed some special trick. It was either the change in perspective or a particular techniques to deal with special case and special business questions the turning point in implementing the data science solution.

Dr. Alkis Simitsis is a Chief Scientist Cyber Security Analytics with Micro Focus. He has more than 15 years of experience in multiple roles building innovative information and data management solutions in areas like real-time business intelligence, security, massively parallel processing, systems optimization, data warehousing, graph processing, and web services. Alkis holds 26 U.S. patents and has filed over 50 patent applications in the U.S. and worldwide, has published more than 100 papers in refereed international journals and conferences (top publications cited 5000+ times), and frequently serves in various roles in program committees of top-tier international scientific conferences. He is an IEEE senior member and a member of ACM.

Presentations

A Magic 8-Ball for Optimal Cost and Resource Allocation for the Big Data Stack Session

Cost and resource provisioning are critical components of the big data stack. A magic 8-ball for the big data stack would give an enterprise a glimpse into its future needs and would enable effective and cost-efficient project and operational planning. This talk covers how to build that magic 8-ball, a decomposable time-series model, for optimal cost and resource allocation for the big data stack.

Rebecca Simmonds is a senior software engineer at Red Hat. Here she is part of an emerging technology group, which comprises of both data scientists and developers. She completed a PhD at Newcastle University, in which she developed a platform for scalable, geospatial and temporal analysis of the Twitter data. After this she moved to a small startup company as a Java developer creating solutions to improve performance for a CV analyser. She has a keen interest in architecture design and data analysis, which she is furthering at Red Hat with Openshift and ML research.

Presentations

Application intelligence: bridging the gap between human expertise and machine learning Session

Artificial intelligence and machine learning are now popularly used terms but how do we make use of these techniques, without throwing away the valuable knowledge of experienced employees. This session will delve into this idea with examples of how distributed machine learning frameworks fit together naturally with business rules management systems.

Animesh Singh is an STSM and lead for IBM Watson and Cloud Platform, where he leads machine learning and deep learning initiatives on IBM Cloud and works with communities and customers to design and implement deep learning, machine learning, and cloud computing frameworks. He has a proven track record of driving design and implementation of private and public cloud solutions from concept to production. In his decade-plus at IBM, Animesh has worked on cutting-edge projects for IBM enterprise customers in the telco, banking, and healthcare Industries, particularly focusing on cloud and virtualization technologies, and led the design and development first IBM public cloud offering.

Presentations

Building a Secure and Transparent ML Pipeline Using Open Source Technologies Session

The application of AI algorithms in domains such as criminal justice, credit scoring, and hiring holds unlimited promise. At the same time, it raises legitimate concerns about algorithmic fairness. There is a growing demand for fairness, accountability, and transparency from machine learning (ML) systems. In this talk we cover how to build just such a pipeline leveraging open source tools.

Pete Skomoroch is Head of Data Products at Workday. He was Co-Founder and CEO of SkipFlag, a venture-backed deep learning startup which was acquired by Workday in 2018. Pete is a senior executive with extensive experience building and running teams that develop products powered by data and machine learning. Previously, he was an early member of the data team at LinkedIn, the world’s largest professional network with over 500 million members worldwide. As a Principal Data Scientist at LinkedIn, he led data science teams focused on reputation, search, inferred identity, and building data products. He was also the creator of LinkedIn Skills and LinkedIn Endorsements, one of the fastest growing new product features in LinkedIn’s history.

Presentations

Executive Briefing: Why Managing Machines is Harder Than You Think Session

Companies that understand how to apply machine intelligence will scale and win their respective markets over the next decade. Others will fail to ship successful AI products that matter to customers. This talk describes how to combine product design, machine learning, and executive strategy to create a business where every product interaction benefits from your investment in machine intelligence.

Guoqiong Song is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She is engaged in developing and optimizing distributed deep learning frameworks on Apache Spark.

Presentations

LSTM-Based Time Series Anomaly Detection Using Analytics Zoo for Spark and BigDL Session

Collecting and processing massive time series data (e.g., logs, sensor readings, etc.), and detecting the anomalies in real time is critical for many emerging smart systems, such as industrial, manufacturing, AIOps, IoT, etc. This talk will share how to detect anomalies of time series data using Analytics Zoo and BigDL at scale on a standard Spark cluster.

Raghotham Sripadraj is Principal Data Scientist at Treebo Hotels. Previously, he was cofounder and data scientist at Unnati Data Labs, where he was building end-to-end data science systems in the fields of fintech, marketing analytics, and event management. Raghotham is also a mentor for data science on Springboard. Previously, at Touchpoints Inc., he single-handedly built a data analytics platform for a fitness wearable company; at SAP Labs, he was a core part of what is currently SAP’s framework for building web and mobile products, as well as a part of multiple company-wide events helping to spread knowledge both internally and to customers.

Drawing on his deep love for data science and neural networks and his passion for teaching, Raghotham has conducted workshops across the world and given talks at a number of data science conferences. Apart from getting his hands dirty with data, he loves traveling, Pink Floyd, and masala dosas.

Presentations

Deep Learning for Fonts Session

Deep Learning has enabled massive breakthroughs in offbeat tracks and has enabled better understanding of how an artist paints, how an artist composes music and so on. As part of Nischal & Raghotham’s loved project - Deep Learning for Humans, they want to build a font classifier and showcase to masses how fonts : * Can be classified * Understand how and why two or more fonts are similar

Scott is a Senior Data Scientist at Faculty, where he develops and deploys state-of-the-art machine learning models. He leads Faculty’s research into the use of deep learning for realistic speech synthesis, and architected core components of the Faculty data science platform. Outside of Faculty, he maintains and contributes to a range of open source software. Scott holds a DPhil in Particle Physics from Oxford University, and before joining Faculty carried out fundamental physics research at CERN and Stanford University.

Presentations

Deep learning for speech synthesis: the good news, the bad news, and the fake news Session

Modern deep learning systems allow us to build speech synthesis systems with the naturalness of a human speaker. Whilst there are myriad benevolent applications, this also ushers in a new era of fake news. This talk will explore the danger of such systems, as well as how deep learning can also be used to build countermeasures to protect against political disinformation.

Bargava Subramanian is a Deep Learning engineer and co-founder of a boutique AI firm, Binaize Labs, in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies. He mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations

Powering Account Based Marketing for Digital Marketers using Deep Learning Session

The spray and pray model for B2B Digital Marketing is long gone. Companies are now employing very specific and tailored messaging for each key account - as part of their Account-Based Marketing strategy. In this talk, the speakers show how they built deep learning models to identify and nurture accounts throughout the account's customer journey.

Ravi is Lead Data Engineer at GoJek. He is building resilient and scalable data infrastructure across all of GO-JEK’s 18+ products that help millions of Indonesians commute, shop, eat and pay, daily.

Presentations

Data Infrastructure at GoJek Session

At GO-JEK, we build products that help millions of Indonesians commute, shop, eat and pay, daily. The Data team is responsible to create resilient and scalable data infrastructure across all of GO-JEK’s 18+ products. This involves building distributed big data infrastructure, real-time analytics and visualization pipelines for billions of data points per day.

Václav Surovec was born in 1988 and he lives in Prague. He works since 2014 in T-Mobile CZ, currently as Senior Big Data engineer. He is co-managing the Big Data department with more than 45 people co-leading several projects focused on Hadoop and Big Data.

Presentations

Data Science in Deutsche Telekom - Predicting global travel patterns and network demand Session

The knowledge of location and travel patterns of customers is important for many companies. One of them is a German telco service operator Deutsche Telekom. Commercial Roaming project using Cloudera Hadoop helped the company to better analyze the behavior of its customers from 10 countries, in a very secure way, to be able to provide better predictions and visualizations for the management.

Anna is an engineering manager at Cloudera where she established and manages the Data Interoperability team. As a software engineer at Cloudera she has worked on Apache Sqoop. Anna cares about enabling people to build high quality software in a sustainable environment. Before her time at Cloudera she worked on Risk Management systems at Morgan Stanley.

Presentations

Picking Parquet: Improved Performance for Selective Queries in Impala, Hive, and Spark Session

The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. We will cover the technical details of the design and its implementation, and we will give practical tips to help data architects leverage these new capabilities in their schema design. Finally, we will show performance results for common workloads.

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial for scalable NLP using the highly performant, highly scalable open-source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Spark NLP in Action: How Indeed Applies NLP to Standardize Resume Content at Scale Session

In this talk you will learn how to use Spark NLP and Apache Spark to standardize semi-structured text. You will see how Indeed standardizes resume content at scale.

Presentations

Keynote with Michael Tidmarsh Keynote

Michael Tidmarsh,

Deepak Tiwari is the head of product management for data at Lyft, where he is responsible for the company’s data vision as well as for building its data infrastructure, data platform, and data products. This includes Lyft’s streaming infrastructure for real-time decision making, geodata store and visualization, platform for machine learning, and core infrastructure for big data analytics. Previously, he was a product management leader at Google, where he worked on search, cloud, and technical infrastructure products. Deepak is passionate about building products that are driven by data, focus on user experience, and work at web scale. He holds an MBA from Northwestern’s Kellogg School of Management and a BT in engineering from the Indian Institute of Technology, Kharagpur.

Presentations

Lyft Data Platform, Now and in the future Session

Lyft’s data platform is at the heart of Lyft’s business. Decisions all the way from pricing, to ETA, to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. In this talk, Mark Grover walks through various choices Lyft has made in the development and sustenance of the data platform and why along with what lies ahead in future.

Teresa Tung is a managing director at Accenture Technology Labs, where she is responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s Applied Intelligence Platform. She is Accenture’s most prolific inventor with 170+ patents and applications. Teresa holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

Executive Briefing: Using a Domain Knowledge Graph to Manage AI at Scale Session

How do enterprises scale moving beyond one-off AI projects to making it re-usable? Teresa Tung and Jean-Luc Chatelain explain how domain knowledge graphs—the same technology behind today's Internet search—can bring the same democratized experience to enterprise AI. Beyond search applications, we show other applications of knowledge graphs in oil & gas, financial services, and enterprise IT.

Sandeep Uttamchandani is the hands-on Chief Data Architect at Intuit. He is currently leading the Cloud transformation of the Big Data Analytics, ML, and Transactional platform used by 3M+ Small Business Users for financial accounting, payroll, and billions of dollars in daily payments. Prior to Intuit, Sandeep has played various engineering roles at VMware, IBM, as well as founding a startup focused on ML for managing Enterprise systems. Sandeep’s experience uniquely combines building Enterprise data products and operational expertise in managing petabyte scale data and analytics platforms in production for IBM’s Federal and Fortune 100 customers. Sandeep has received several excellence awards, and over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, USENIX. Sandeep is a regular speaker at academic institutions, guest lectures for university courses, as well as conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as Program Committee Member for systems and data conferences, and the past associate editor for ACM Transactions on Storage. He blogs on LinkedIn and Wrong Data Fabric (his personal blog). Sandeep is a Ph.D. in Computer Science from University of Illinois at Urbana-Champaign.

Presentations

Half correct and Half wrong tribal data knowledge: Our 3 patterns to sanity! Session

Teams today rely on tribal data dictionaries which is a mixed bag w.r.t. correctness -- some datasets have accurate attribute details, while others are incorrect & outdated. This significantly impacts productivity of analysts & scientists. Existing tools for data dictionary are manually updated and difficult to maintain. This talk covers 3 patterns we have deployed to manage data dictionaries.

Vinithra Varadharajan is a senior engineering manager in the cloud organization at Cloudera, where she is responsible for the cloud portfolio products, including Altus Data Engineering, Altus Analytic Database, Altus SDX, and Cloudera Director. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Nanda Vijaydev is director of solution management at BlueData, where she leverages technologies like Hadoop, Spark, Python, and TensorFlow to build solutions for enterprise analytics and machine learning use cases. Nanda has 10 years of experience in data management and data science. Previously, she worked on data science and big data projects in multiple industries, including healthcare and media; was a principal solutions architect at Silicon Valley Data Science; and served as director of solutions engineering at Karmasphere. Nanda has an in-depth understanding of the data analytics and data management space, particularly in the areas of data integration, ETL, warehousing, reporting, and machine learning.

Presentations

Deep learning with TensorFlow and Spark using GPUs and Docker containers Session

Organizations need to keep ahead of their competition by using the latest AI/ML/DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. This session will discuss the effective deployment of such applications in a container environment.

Lars is a software engineer at Cloudera. He has worked on various parts of Apache Impala including crash handling, its Parquet scanners, and scan range scheduling. Most recently he worked on integrating Kudu’s RPC framework into Impala. Before his time at Cloudera he worked on various databases at SAP.

Presentations

Picking Parquet: Improved Performance for Selective Queries in Impala, Hive, and Spark Session

The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. We will cover the technical details of the design and its implementation, and we will give practical tips to help data architects leverage these new capabilities in their schema design. Finally, we will show performance results for common workloads.

Scaling Impala - Common Mistakes and Best Practices` Session

Apache Impala is a MPP SQL query engine for planet scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. In this talk, we will discuss how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and antipatterns for end users or BI applications.

Dr. Sandra Wachter is a lawyer and Research Fellow (Asst. Prof.) in Data Ethics, AI, robotics and Internet Regulation/cyber-security at the Oxford Internet Institute at the University of Oxford where she also teaches the course Internet Technologies and Regulation. Sandra is also a Fellow at the Alan Turing Institute in London, a Fellow of the World Economic Forum’s Global Futures Council on Values, Ethics and Innovation, an Academic Affiliate at the Bonavero Institute of Human Rights at Oxford’s Law Faculty and a member of the Law Committee of the IEEE. Prior to joining the OII, Sandra studied at the University of Oxford and the Law Faculty at the University of Vienna and worked at the Royal Academy of Engineering and at the Austrian Ministry of Health.

Sandra serves as a policy advisor for governments, companies, and NGO’s around the world on regulatory and ethical questions concerning emerging technologies. Her work has been featured in (among others) The Telegraph, Financial Times, The Sunday Times, The Economist, Science, BBC, The Guardian, Le Monde, New Scientist, Die Zeit, Der Spiegel, Sueddeutsche Zeitung, Engadget and, WIRED. In 2018 she won the ‘O2RB Excellence in Impact Award’ and in 2017 the CognitionX ‘AI superhero Award’ for her contributions in AI governance.

Sandra is specialising in technology-, IP-, and data protection law as well as European-, International-, human rights and medical law. Her current research focuses on the legal and ethical implications of Big Data, AI, and robotics as well as governmental surveillance, predictive policing, and human rights online. She is also working on ethical design of algorithms, including the development of standards and (auditing) methods to ensure fairness, accountability, transparency, interpretability, and group privacy in complex algorithmic systems.

Sandra is also interested in legal and ethical aspects of robotics (e.g. surgical, domestic and social robots) and autonomous systems (e.g. autonomous and connected cars), including liability, accountability, and privacy issues as well as international policies and regulatory responses to the social and ethical consequences of automation.

Internet policy and regulation as well as cyber-security issues are also at the heart of her research, where she addresses areas such as online surveillance and profiling, censorship, intellectual property law, and human rights and identity online. Areas such as mass surveillance methods such as the European Data Retention Directive and its compatibility with the jurisprudence of the European Court of Human Rights as well as tensions between freedom of speech and the right to privacy on social networks are of particular interest. Previous work also looked at (bio) medical law and bio ethics in areas such as interventions in the genome and genetic testing under the Convention on Human Rights and Biomedicine.

Presentations

Privacy, identity, and autonomy in the age of Big Data and AI Keynote

Dr. Sandra Wachter is a lawyer and Research Fellow (Asst. Prof.) in Data Ethics, AI, robotics and Internet Regulation/cyber-security at the Oxford Internet Institute.

Kai Waehner is a technology evangelist at Confluent. Kai’s areas of expertise include big data analytics, machine learning, deep learning, messaging, integration, microservices, the internet of things, stream processing, and the blockchain. He is regular speaker at international conferences such as JavaOne, O’Reilly Software Architecture, and ApacheCon and has written a number of articles for professional journals. Kai also shares his experiences with new technologies on his blog.

Presentations

Unleashing Apache Kafka and TensorFlow in Hybrid Architectures Session

How can you leverage the flexibility and extreme scale in public cloud combined with Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures, which span multiple public clouds or bridge your on-premise data centre to cloud? Join this talk to learn how to apply technologies such as TensorFlow with Kafka’s open source ecosystem for machine learning infrastructures

Chris Wallace is a data scientist at Cloudera Fast Forward Labs. He works on making breakthroughs in machine intelligence accessible and applicable in the “real world”. He has previous experience doing data science in organisations both large (the UK NHS) and small (first employee at a tech startup). Chris likes building data products and cares deeply about making technology work for people, not vice versa. He holds a PhD in particle physics from the University of Durham.

Presentations

Federated learning: machine learning with privacy on the edge Session

Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. In this talk we’ll cover the algorithmic solutions and the product opportunities.

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Dean Wampler, Ph.D. is the VP of Fast Data Engineering at Lightbend, where he leads the development of the Lightbend Fast Data Platform, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Executive Briefing: What it takes to use machine learning in fast data pipelines Session

Your team is building Machine Learning capabilities. I'll discuss how you can integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed. There are big challenges. How do you build long-running services that are very reliable and scalable? How do you combine a spectrum of very different tools, from data science to operations?

Hands-on Machine Learning with Kafka-based Streaming Pipelines Tutorial

This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques.

Jason Wang is a software engineer at Cloudera focusing on the cloud.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Mr. Moshe Wasserblat is currently the Natural Language Processing and Deep Learning Research Group Manager for Intel’s Artificial Intelligence Products Group. In his former role, he has been with NICE Systems for more than 17 years, where he founded and led the Speech/Text Analytics Research team. His interests are in the field of speech processing and natural language processing. He was the co-founder coordinator of the EXCITEMENT FP7 ICT program and served as organizer and manager of several initiatives, including many Israeli Chief Scientist programs. He has filed more than 60 patents in the field of Language Technology and also has several publications in international conferences and journals. His areas of expertise include: Speech Recognition, Conversational Natural Language Processing, Emotion Detection, Speaker Separation, Speaker Recognition, Deep Learning, and Machine Learning.

Presentations

NLP Architect by Intel's AI-Lab Session

Moshe Wasserblat presents an overview of NLP Architect, an open source DL NLP library that provides SOTA NLP models making it easy for researchers to implement NLP algorithms and for data scientists to build NLP based solution for extracting insight from textual data to improve business operations.

Dr Sophie Watson is a software engineer in an Emerging Technology Group at Red Hat, where she applies her data science and statistics skills to solving business problems and informing next-generation infrastructure for intelligent application development. She has a background in mathematics and holds a PhD in Bayesian statistics, in which she developed algorithms to estimate intractable quantities quickly and accurately.

Presentations

Learning "Learning to Rank" Session

Identifying relevant documents quickly and efficiently enhances both user experience and business revenue every day. Sophie Watson demonstrates how to implement Learning to Rank algorithms and provides you with the information you need to implement your own successful ranking system.

Thomas is Software Engineer, Streaming Platform at Lyft, working with Apache Flink. He is also a PMC member of Apache Apex and Apache Beam and has contributed to several more of the ecosystem projects. Thomas is a frequent speaker at international big data conferences and author of the book Learning Apache Apex.

Presentations

Streaming at Lyft Session

Fast data and stream processing are essential for making Lyft rides a good experience for passengers and drivers. Our systems need to track and react to event streams in real-time, to update locations, compute routes and estimates, balance prices and more. The streaming platform at Lyft powers these use cases with development frameworks and deployment stack that are based on Apache Flink and Beam.

Charlotte Werger works at the intersection of Artificial Intelligence and Finance. After completing her PhD at the European University Institute in Florence, she worked in quantitative hedge funds at BlackRock and Man AHL in London as a portfolio manager and quant researcher. There she was part of an early movement in asset management that initiated the application of machine learning models to predicting financial markets. Having developed a broader interest in AI and Machine learning, she then worked for ASI Data Science, a cutting edge AI start-up which helps its clients by building AI applications and software. Currently Charlotte is working in Amsterdam, as Lead Data Scientist at Van Lanschot Kempen, a wealth manager and private bank. Here she is challenged to transform this traditional company to a cutting edge data-driven one. Outside of work she is internationally active in the field of Data Science and AI education and advisory. She is an instructor for Datacamp, mentors data science students on the Springboard platform and has an advisory role at Ryelore AI.

Presentations

Data science transformation: transforming a traditional wealth manager to a cutting edge data driven company Findata

In this talk we outline the components necessary to transform a traditional wealth manager into a data driven business. Special attention is made on devising and executing a transformation strategy by identifying key business sub-units where automation and improved predictive modelling can result in significant gains and synergies.

Fraud Detection at a Financial Institution using Unsupervised Learning & Text mining Session

This talk discusses a best practice use case for detecting fraud at a financial institution. Where traditional systems fall short, machine learning models can provide a solution. Sifting through large amounts of transaction data, external hit lists, and unstructured text data we managed to build a dynamic and robust monitoring system that successfully detects unwanted client behavior.

Elliot is a principal engineer at Hotels.com in London where he designs tooling and platforms in the big data space. Prior to this Elliot worked in Last.fm’s data team, developing services for managing large volumes of music metadata.

Presentations

Herding Elephants: Seamless data access in a multi-cluster clouds Session

Expedia Group is a travel platform with an extensive portfolio including Expedia.com and Hotels.com. We like to give our data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. We'll explain how we built a unified virtual data lake on top of our many heterogeneous and distributed data platforms.

Mutant Tests Too: The SQL Session

Hotels.com describe approaches for applying software engineering best practices to SQL-based data applications in order to improve maintainability and data quality. Using open source tools we show how to build effective test suites for Apache Hive code bases. We also present Mutant Swarm, a mutation testing tool we’ve developed to identify weaknesses in tests and to measure SQL code coverage.

Dr. Arif Wider is a lead consultant and developer at ThoughtWorks Germany, where he enjoys building scalable applications, teaches Scala, and consults at the intersection of data science and software engineering. Before joining ThoughtWorks he has been in research with a focus on data synchronisation, bidirectional transformations, and domain-specific languages.

Presentations

Continuous Intelligence: Keeping your AI Application in Production Session

Machine learning can be challenging to deploy and maintain. Data change, and both models and the systems that implement them must be able to adapt. Any delays moving models from research to production means leaving your data scientists' best work on the table. In this talk, we explore continuous delivery (CD) for AI/ML, and explore case studies for applying CD principles to data science workflows.

Alicia is an advocate for Google Cloud. Previously she spent six years as a program manager and through building, managing, and measuring programs and processes, she fell in love with data science. Known to hang out in spreadsheets surrounded by formulas, she also uses machine learning, SQL, and visualizations to help solve problems and tell stories.

Presentations

Building custom machine learning models for production, without ML expertise DCS

In this talk, Alicia Williams will share how two media companies followed this path to organize content and make it accessible around the world. Along the way, we will talk about the business problems they solved with ML, demonstrate the ease-of-use of the tools themselves, and show the value that ML has brought in each case.

Christoph Windheuser studied computer science in Bonn (Germany), Pittsburgh (USA) and Paris (France). He made his PhD in Speech Recognition with Artificial Neural Networks. After his scientific career he worked in different positions in the IT industry (i.a. SAP, Capgemini Consulting). Today, Christoph is the Global Head of Intelligent Empowerment at ThoughtWorks Inc. and responsible for ThoughtWorks positioning on Data Management, Machine Learning and Artificial Intelligence.

Presentations

Continuous Intelligence: Moving Machine Learning into Production Reliably Tutorial

In this workshop, we will present how to apply the concept of Continuous Delivery (CD) - which ThoughtWorks pioneered - to data science and machine learning. It allows data scientists to make changes to their models, while at the same time safely integrating and deploying them into production, using testing and automation techniques to release reliably at any time and with a high frequency.

Mingxi Wu is the vice president of engineering at TigerGraph, a Silicon Valley-based startup building a world-leading real-time graph database. Over his career, Mingxi has focused on database research and data management software. Previously, he worked in Microsoft’s SQL Server group, Oracle’s Relational Database Optimizer group, and Turn Inc.’s Big Data Management group. Lately, his interest has turned to building an easy-to-use and highly expressive graph query language. He has won research awards from the most prestigious publication venues in database and data mining, including SIGMOD, KDD, and VLDB and has authored five US patents with three more international patents pending. Mingxi holds a PhD from the University of Florida, specializing in both database and data mining.

Presentations

Eight Prerequisites of a Graph Query Language Session

Graph query language is the key to unleash the value from connected data. In this talk, we point out 8 prerequisites of a practical graph query language concluded from our 6 years experience in dealing with real world graph analytical use cases. And compare GSQL, Gremlin, Cypher and Sparql in this regard.

Tony Wu manages the Altus core engineering team at Cloudera. Previously, Tony was a team lead for the partner engineering team at Cloudera. He is responsible for Microsoft Azure integration for Cloudera Director.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Chendi Xue is a software engineer from Intel SSG Data Analytics Team. She has five years’ experience in Linux cloud storage system development, optimization and benchmark, including Ceph benchmark and tuning, Spark on disaggregate storage performance tuning and optimization, and HDCS development.

Presentations

Bigdata analytics on the public cloud: Challenges and opportunities Session

Introduce the challenges of migrating bigdata analytics workloads to public cloud - like performance lost, and missing features. Show case how to the new in memory data accelerator leveraging persistent memory and RDMA NICs can resolve this issues and enables new opportunities for bigdata workloads on the cloud.

A Big Data Tech Lead at the Nielsen Marketing Cloud. I have been dealing with Big Data challenges for the past 6 years, using tools like Spark, Druid, Kafka, and others.
I’m keen about sharing my knowledge and have presented my real-life experience in various forums in the past (e.g meetups, conferences, etc.).

Presentations

Stream, Stream, Stream: Different Streaming methods with Spark and Kafka Session

At Nielsen Marketing Cloud, we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner. In this talk, we will discuss how we continuously transform our data infrastructure to support these goals.

Alexis Yelton is a data scientist at Indeed. She has a Ph.D. in bioinformatics and did postdoctoral work building models to predict gene function and explain ecosystem function. Since then she has been focusing on building machine learning models for software products. She has been working with Spark since version 1.6 and has recently moved into the NLP space.

Presentations

Spark NLP in Action: How Indeed Applies NLP to Standardize Resume Content at Scale Session

In this talk you will learn how to use Spark NLP and Apache Spark to standardize semi-structured text. You will see how Indeed standardizes resume content at scale.

Jian Zhang is an software engineer manager at Intel, he and his team primarily focused on Open Source Storage development and optimizations on Intel platforms, and build reference solutions for customers. He has 10 years of experiences on performance analysis and optimization for many open source projects like Xen, KVM, Swift and Ceph, HDFS and benchmarking workloads like SPEC-, TPC. Jian has a master’s degree in Computer Science and Engineering of Shanghai Jiaotong university.

Presentations

Bigdata analytics on the public cloud: Challenges and opportunities Session

Introduce the challenges of migrating bigdata analytics workloads to public cloud - like performance lost, and missing features. Show case how to the new in memory data accelerator leveraging persistent memory and RDMA NICs can resolve this issues and enables new opportunities for bigdata workloads on the cloud.

Weifeng Zhong is a research fellow in economic policy studies at the American Enterprise Institute, where his research focuses on Chinese economic issues and political economy. His recent work has been on the application of text-analytic and machine-learning techniques to political economy issues such as the US presidential election, income inequality, and predicting policy changes in China. He has been published in a variety of scholarly journals, including the Journal of Institutional and Theoretical Economics. In the popular press, his writings have appeared in the Financial Times, Foreign Affairs, The National Interest, and Real Clear Politics, among others. He has a Ph.D. and an M.Sc. in managerial economics and strategy from Northwestern University. He also holds M.Econ. and M.Phil. degrees in economics from the University of Hong Kong and a B.A. in business administration from Shantou University in China.

Presentations

Reading China: Predicting Policy Change with Machine Learning Session

We developed a machine learning algorithm to “read” the People’s Daily — the official newspaper of the Communist Party of China — and predict changes in China’s policy priorities using only the information in the newspaper. The output of this algorithm, which we call the Policy Change Index (PCI) of China, turns out to be a leading indicator of the actual policy changes in China since 1951.

Yuan Zhou is a Senior Software Development Engineer in the Software and Service Group for Intel Corporation, working in the Opensource Technology Center team primarily focused on Bigdata Storage Software. He has been working in Databases, Virtualization and Cloud computing for most of his 7+ year career at Intel.

Presentations

Bigdata analytics on the public cloud: Challenges and opportunities Session

Introduce the challenges of migrating bigdata analytics workloads to public cloud - like performance lost, and missing features. Show case how to the new in memory data accelerator leveraging persistent memory and RDMA NICs can resolve this issues and enables new opportunities for bigdata workloads on the cloud.

Xiaoyong Zhu is a senior data scientist at Microsoft, where he focuses on distributed machine learning and its applications.

Presentations

Inclusive Design: Deep Learning on audio in Azure, identifying sounds in real-time. Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. We will explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.