Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference
Singapore

Speakers

Leading practitioners will share tips, best practices, and real-world expertise in a rich variety of sessions. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Mohammed Abdoolcarim is cofounder and head of product at Vahan, a company building a virtual coach for Uber drivers and distributed workers. Previously, he was lead product manager at Siri and a manager at Google, Apple, Misfit (acquired by Fossil), and GoButler. He hold a degree from Stanford’s d.school.

Presentations

Designing AI-based conversational UIs Session

For the first time, messaging apps have surpassed social networks in usage and growth. Mohammed Abdoolcarim shares best practices for designing for AI-based conversational UIs, such as those employed in messaging apps, drawn from work done at Apple, Google, and GoButler.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data, AI, and data science have great potential for accelerating business, but how do you reconcile business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. John Akred explains how to create a modern data strategy that powers data-driven business.

Executive Briefing: The business case for AI, Spark, and friends Session

AI is white-hot at the moment, but where can it really be used? Developers are usually the first to understand why some technologies cause more excitement than others. John Akred relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring.

Organizing for machine learning success Session

Deploying machine learning in business requires far more than just selecting an algorithm. You need the right architecture, tools, and team organization to drive your agenda successfully. John Akred and Mark Hunter share practical advice on the technical and human sides of machine learning, based on experience preparing Sainsbury’s for its ML-enabled future.

Sarang Anajwala is technical product manager for Autodesk’s next-generation data platform, where he focuses on building the self-service platform for data analysis and data products. Sarang has extensive experience in data architecture and data strategy. Previously, he was an architect building robust big data systems, including a next-generation platform for contextual communications and a data platform for the IoT. He has filed three patents for his innovations in the contextual communications and adaptive designs spaces.

Presentations

Enabling data-driven decision making: Challenges of logical and physical scale Session

Sarang Anajwala offers an overview of Autodesk’s centralized data platform, which democratizes analytics across various teams within Autodesk. The platform has gone through multiple iterations to optimize the balance between a complex one-size-fits-all data access layer and multiple fragmented noncohesive data access layers.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Executive Briefing: The five dysfunctions of a data engineering team Session

Early project success is predicated on management making sure a data engineering team is ready and has all of the skills needed. Jesse Anderson outlines five of the most common nontechnology reasons why data engineering teams fail.

Real-time systems with Spark Streaming and Kafka 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time systems with Spark Streaming and Kafka (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

What executives and managers need to know about architecture and why Session

We have an explosion of new architectures. Are these new architectures because engineers love new things or is there a good business reason for these changes? In this talk, we will consider these new architectures and the actual business problems they solve. You may find out that your team is far less productive if you don’t move to these architectures.

Presentations

From physical data collection to digital delivery of results: The data journey in developing economies DCS

Alex Chade shares how Dotz used a coalition loyalty program to successfully collect transactional data (down to the SKU level) from its tens of millions of members across a number of segments, including grocery, gas, pharma, apparel, electronics, CPGs, insurance, and credit cards, and did so mostly in the physical world.

Aki Ariga is a field data scientist at Cloudera, where he works on service development with machine learning and natural language processing. His work has included researching spoken dialogue systems, building a large corpus analysis system, and developing services such as recipe recommendations. Aki is a sparklyr contributor. He organizes several tech communities in Japan, including Ruby, machine learning, and Julia.

Presentations

Train, predict, and serve: How to put your machine learning model into production Session

Aki Ariga explains how to put your machine learning model into production, discusses common issues and obstacles you may encounter, and shares best practices and typical architecture patterns of deployment ML models with example designs from the Hadoop and Spark ecosystem using Cloudera Data Science Workbench.

Carme Artigas is the founder and CEO of Synergic Partners, a strategic and technological consulting firm specializing in big data and data science (acquired by Telefónica in 2015). She has more than 20 years of extensive expertise in the telecommunications and IT fields and has held several executive roles in both private companies and governmental institutions. Carme is a member of the Innovation Board of CEOE and the Industry Affiliate Partners at Columbia University’s Data Science Institute. An in-demand speaker on big data, she has given talks at several international forums, including Strata Data Conference, and collaborates as a professor in various master’s programs on new technologies, big data, and innovation. Carme was recently recognized as the only Spanish woman among the 30 most influential women in business by Insight Success. She holds an MS in chemical engineering, an MBA from Ramon Llull University in Barcelona, and an executive degree in venture capital from UC Berkeley’s Haas School of Business.

Presentations

Analytics at the core of IoT ecosystems Smart Cities

Carme Artigas explains why companies need an IoT strategy based on data analytics to create value for business.

Executive briefing: Analytics centers of excellence as a way to accelerate big data adoption by business Session

Carme Artigas explains why an analytics center of excellence (ACoE), whether internal or outsourced, is an effective way to create mechanisms to deploy big data across the entire organization rather than simply serving a particular department or use case.

From smart cities to intelligent societies Keynote

The concept of smart cities has evolved from sensored urban centers to platform ecosystems that combine data with new technologies such as the IoT, the cloud, and AI. Carme Artigas explores the challenges and opportunities of evolving from smart cities to intelligent societies.

Amr Awadallah is the cofounder and CTO at Cloudera. Previously, Amr was an entrepreneur in residence at Accel Partners, served as vice president of product intelligence engineering at Yahoo, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr’s first startup, VivaSmart, was acquired by Yahoo in July 2000. Amr holds bachelor’s and master’s degrees in electrical engineering from Cairo University, Egypt, and a PhD in electrical engineering from Stanford University.

Presentations

The sixth wave: Automation of decisions Keynote

We are witnessing a new revolution in data—the age of decision automation. Amr Awadallah explains the historic importance of this next wave in automation and highlights the foundational capabilities required to enable it: machine learning and analytics optimized for the cloud.

Ricky Barron is founder and principal at InfoStrategy, a data management and analytics strategy consultantancy helping medium- to large-enterprises develop and operationalize insights for their businesses.

Presentations

Executive Briefing: How to structure, recruit, operationalize, and maintain your insights organization Session

To many organizations, big data analytics is still a solution looking for a problem. Ricky Barron shares practical methods for getting the best out of your big data analytics capability and explains why establishing an "insights group" can improve the bottom line, drive performance, optimize processes, and create new data-driven products and solutions.

Zhaojuan Bianny Bian is an engineering manager in Intel’s Software and Service Group, where she focuses on big data cluster modeling to provide services in cluster deployment, hardware projection, and software optimization. Bianny has more than 10 years of experience in the industry with performance analysis experience that spans big data, cloud computing, and traditional enterprise applications. She holds a master’s degree in computer science from Nanjing University in China.

Presentations

Best practices with Kudu: An end-to-end user case from the automobile industry Session

Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.

Joshua Bloom is vice president of data and analytics at GE Digital, where he serves as the technology and research lead bringing machine learning applications to market within the GE ecosystem. Previously, Joshua was cofounder and CTO of Wise.io (acquired by GE Digital in 2016). Since 2005, he has also been an astronomy professor at the University of California, Berkeley, where he teaches astrophysics and Python for data science. Josh has been awarded the Moore Foundation Data-Driven Investigator Prize and the Pierce Prize from the American Astronomical Society; he is also a former Sloan fellow, a junior fellow of the Harvard Society, and a Hertz Foundation fellow. Joshua holds a PhD from Caltech and degrees from Harvard and Cambridge University.

Presentations

Industrial machine learning Keynote

The ongoing digitization of the industrial-scale machines that power and enable human activity is itself a major global transformation. Joshua Bloom explains why the real revolution—in efficiencies and in improved and saved lives—will happen when machine learning automation and insights are properly coupled to the complex systems of industrial data.

Natalino Busa is the chief data architect at DBS, where he leads the definition, design, and implementation of big, fast data solutions for data-driven applications, such as predictive analytics, personalized marketing, and security event monitoring. Natalino is an all-around technology manager, product developer, and innovator with a 15+-year track record in research, development, and management of distributed architectures and scalable services and applications. Previously, he was the head of data science at Teradata, an enterprise data architect at ING, and a senior researcher at Philips Research Laboratories on the topics of system-on-a-chip architectures, distributed computing, and parallelizing compilers.

Presentations

Data production pipelines: Legacy, practices, and innovation Session

Modern engineering requires machine learning engineers, who are needed to monitor and implement ETL and machine learning models in production. Natalino Busa shares technologies, techniques, and blueprints on how to robustly and reliably manage data science and ETL flows from inception to production.

Cupid Chan is a managing partner at 4C Decision, where he helps clients ranging from Fortune 500 companies to the public sector leverage the power of data, analytics, and technology to gain invaluable insight to improve various aspects on their businesses. Previously, he was one of the key players in the construction of a world-class BI platform. A bilingual seasoned professional, Cupid holds various technical and business accreditations, such as PMP and Lean Six Sigma.

Presentations

Big data on the rise: Views of emerging trends and predictions from real-life end users Session

John Mertic and Cupid Chan share real end-user perspectives from companies like GE on how they are using big data tools, challenges they face, and where they are looking to focus investments—all from a vendor-neutral viewpoint.

Wei Chen is a software engineer at Intel. He is dedicated to performance optimization and simulation of storage engines for big data. Wei holds a master’s degree in signal and information processing from Nanjing University in China.

Presentations

Best practices with Kudu: An end-to-end user case from the automobile industry Session

Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.

Jessica Chen Riolfi is the head of Asia at TransferWise, where she accelerates TransferWise’s mission of money without borders throughout Asia. Jessica has spent her career building global products. Previously, she led growth at eBay, where she focused on reaching new customers in China, Brazil, and Mexico, and helped launch the eBay store on WeChat. Jessica holds an MBA from Harvard Business School and a BA from Dartmouth College.

Presentations

Executive Briefing: The data-driven growth engine Session

Data is essential to unlock growth opportunities, and successful companies use it in every decision. Jessica Chen Riolfi explains how to build an organization with decentralized, data-driven decision making that enables teams to focus on the products and features that matter and ultimately unlock exponential growth.

Cheng Feng is a data engineer at Grab, where he works on the big data platform, distributed computing, stream processing, and data science. Previously, he was a data scientist at the Lazada Group, working on Lazada’s tracker, customer segmentation and recommendation systems, and fraud detection.

Presentations

Operationalizing Presto in the cloud: Lessons and mistakes Session

Grab uses Presto to support operational reporting (batch and near real-time), ad hoc analyses, and its data pipeline. Currently, Grab has 5+ clusters with 100+ instances in production on AWS and serves up to 30K queries per day while supporting more than 200 internal data users. Feng Cheng and Yanyu Qu explain how Grab operationalizes Presto in the cloud and share lessons learned along the way.

Anand Chitipothu is a platform architect at rorodata, a data science platform that he recently cofounded. Anand has been crafting beautiful software for more than a decade. Previously, he worked at Strand Life Sciences and Internet Archive. Anand regularly conducts advanced programming courses through Pipal Academy and is the coauthor of web.py, a micro web framework in Python.

Presentations

Managing machine learning models in production Session

There are many challenges to deploying machine models in production, including managing multiple versions of models, maintaining staging and production models, keeping track of model performance, logging, and scaling. Anand Chitipothu explores the tools, techniques, and system architecture of a cloud platform built to solve these challenges and the new opportunities it opens up.

Based in Singapore, Sachin works closely with leading telecom and technology companies across South East Asia, Hong Kong and Australia. He has worked extensively on both B2C and B2B spaces in setting up and scaling new digital businesses and has delivered large scale commercial turnaround programs. Sachin brings deep expertise in digital marketing and sales with a specific focus on channel digitization, advanced analytics and pricing. His experience extends end to end starting from crafting digital strategy to driving the end to end technical implementation and he has personally helped some of our largest TMT clients identify key digital opportunities and enabled them to pursue them at scale.

As a leading expert on digital disruption, particularly on commercial operations, Sachin has helped shape the firm’s thinking on how to enable digital transformation at scale in Asian markets, where the starting point and the challenges are very different to other markets. Given his prior experience in mid-market investing he also brings a perspective of how to approach these new digital opportunities with a venture capital mind-set.

Presentations

Executive Briefing: Artificial intelligence—The next digital frontier? Session

After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Shilpa Aggarwal explains where investment is going, patterns of AI adoption, and how the value potential of AI across sectors and business functions is beginning to emerge in Asia.

Victor Chua is a senior data scientist and visualizer on the innovations team at SmartHub (part of StarHub Ltd.), where he is responsible for building and delivering unique telco analytics products through big data technologies. Victor has a strong passion for data analytics and 3D graphics. He holds a master’s degree in information systems management from Carnegie Mellon University.

Presentations

Analyzing smart cities and big data in 3D: A Geo3D journey at SmartHub Smart Cities

The rise of densely populated, highly built-up smart cities around the globe has stretched the capabilities of current 2D visualization techniques. With the advent of drones, IoT devices, and indoor geolocation, next-gen 3D visualizations are beginning to address this challenge. Victor Chua explores how SmartHub is gearing up for a 3D future to support cutting-edge data analytics.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Case studies in data ethics DCS

From predictive algorithms, to markets that fight back, to the legal and moral questions of automation, Strata chair and Lean Analytics author Alistair Croll looks at examples of ethical issues from across a wide range of case studies in the private and public sector, inspired by his role as a Visiting Executive at Harvard Business School.

Data Case Studies welcome Tutorial

Program chair Alistair Croll welcomes you to the Data Case Studies tutorial.

Smart Cities welcome Tutorial

Program chair Alistair Croll welcomes you to the first day of the Smart Cities tutorial.

The dumb consequences of smart cities Session

We infuse urban spaces with sensors, drinking from a torrent of data, making sense of city life. But this reliance on data has real risks: Complex systems often have unintended consequences, and it's hard to experiment. Alistair Croll shares lessons from the past and explains how paving the cowpaths, examining the models, and iterating everything can mitigate these risks.

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Amit Das is the cofounder and CEO of Think Analytics India, where he has conceptualized many fintech enabler analytics solutions, including Algo360, an alternate data solution. Amit fell in love with analytics while working at Tata Consultancy Services. Over his career, he has worked for a number of successful companies, including Inductis (acquired by ExL Services) and Diamond Consultants (acquired by PwC LLP USA). Amit also led analytics delivery for PwC USA and was an EVP at 3i Infotech limited, where he set up analytics as a capability and built smarter software products for banking and financial services. Amit’s love for data led him to build the foundations for an emerging market consumer dataset through Vito, a cutting-edge alternate data solution for the Indian market that brings down the cost of underwriting by over 40%. Amit holds a master’s degree in management from the Indian institute of Management, Bangalore, and an undergraduate degree in economics from the University of Delhi.

Presentations

Driving financial inclusion in emerging markets using alternate data and big data analytics Session

Access to credit in emerging markets is impeded by issues around identity verification, risk assessment and monitoring, and the costs of underwriting and collections. At the core of it all is a lack of data. Amit Das explains how accessing alternate data, real-time risk monitoring and data access solutions, and smart analytics is changing the lending landscape in India.

Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine, Gobblin, a data lifecycle management platform for Hadoop, WhereHows, a data discovery and lineage platform, and Dali, a data virtualization layer for Hadoop.

Presentations

Privacy by design, not an afterthought: Best practices at LinkedIn Session

LinkedIn houses the most valuable professional data in the world. Protecting the privacy of member data has always been paramount. Shirshanka Das and Tushar Shanbhag outline three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement framework, and a unified data access layer.

Danielle Dean is a principal data scientist lead at Microsoft in the Algorithms and Data Science Group within the Artificial Intelligence and Research Division, where she leads a team of data scientists and engineers building predictive analytics and machine learning solutions with external companies utilizing Microsoft’s Cloud AI Platform. Previously, she was a data scientist at Nokia, where she produced business value and insights from big data through data mining and statistical modeling on data-driven projects that impacted a range of businesses, products, and initiatives. Danielle holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill, where she studied the application of multilevel event history models to understand the timing and processes leading to events between dyads within social networks.

Presentations

Bootstrap custom image classification using transfer learning Session

Transfer learning enables you to use pretrained deep neural networks (e.g., AlexNet, ResNet, and Inception V3) and adapt them for custom image classification tasks. Danielle Dean and Wee Hyong Tok walk you through the basics of transfer learning and demonstrate how you can use the technique to bootstrap the building of custom image classifiers.

Training and scoring deep neural networks in the cloud Session

Deep neural networks are responsible for many advances in natural language processing, computer vision, speech recognition, and forecasting. Danielle Dean and Wee Hyong Tok illustrate how cloud computing has been leveraged for exploration, programmatic training, real-time scoring, and batch scoring of deep learning models for projects in healthcare, manufacturing, and utilities.

Cesar Delgado is the Siri platform architect at Apple. He has also worked on iTunes, iCloud, News, and Maps. Previously, Cesar worked at various startups around Silicon Valley. He has been involved in the Apache Hadoop community since 2008.

Presentations

Siri: The journey to consolidation Keynote

Twenty years ago, a company implored us to “think different” about personal computers. Today, Apple continues to live and breathe that legacy. It’s evident in the machine learning and analytics architectures that power many of the company’s most innovative applications. Cesar Delgado joins Mick Hollison to discuss how Apple is using its big data stack and expertise to solve non-data problems.

Presentations

Predicting hospital readmission among diabetes patients Session

Prediction of readmission can create savings in money and time for various stakeholders such as government, hospital, insurer, employer, and most importantly, patient. Resource-intensive interventions can be targeted to patients at greatest risk.

Thomas W. Dinsmore is director of product marketing for Cloudera Data Science. Previously, he served as a knowledge expert on the strategic analytics team at the Boston Consulting Group; director of product management for Revolution Analytics; analytics solution architect at IBM Big Data Solutions; and a consultant at SAS, PricewaterhouseCoopers, and Oliver Wyman. Thomas has led or contributed to analytic solutions for more than five hundred clients across vertical markets and around the world, including AT&T, Banco Santander, Citibank, Dell, J.C.Penney, Monsanto, Morgan Stanley, Office Depot, Sony, Staples, United Health Group, UBS, and Vodafone. His international experience includes work for clients in the United States, Puerto Rico, Canada, Mexico, Venezuela, Brazil, Chile, the United Kingdom, Belgium, Spain, Italy, Turkey, Israel, Malaysia, and Singapore.

Presentations

Data science at team scale: Considerations for sharing, collaborating, and getting to production Session

Data science alone is easy. Data science with others, in the enterprise, on shared distributed systems, requires a bit more work. Thomas Dinsmore and Johnson Poh share common technology considerations and patterns for collaboration in large teams and best practices for moving machine learning into production at scale.

Wolff Dobson is a developer programs engineer at Google specializing in machine learning and games. Previously, he worked as a game developer, where his projects included writing AI for the NBA 2K series and helping design the Wii Motion Plus. Wolff holds a PhD in artificial intelligence from Northwestern University.

Presentations

TensorFlow: Open source machine learning Session

TensorFlow, the world's most popular machine learning framework, is fast, flexible, and production ready. Wolff Dobson, representing the Google Brain team, shares the latest developments in TensorFlow, including tensor processing units (TPUs), distributed training, new APIs and models, and mobile features. Join in to learn what's in store for TensorFlow and how ML can change your business.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

GDPR: Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Steven Ross and Mark Donsky outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Smart cities, the smart grid, the IoT, and big data Smart Cities

Smart cities and the electricity smart grid have become leading examples of the IoT, in which distributed sensors describe mission-critical behavior by generating billions of metrics daily. Mark Donsky and Syed Rafice show how smart utilities and cities rely on Hadoop to capture, analyze, and harness this data to increase safety, availability, and efficiency across the entire electricity grid.

Graham Dumpleton is a developer advocate for OpenShift at Red Hat. Graham is the author of mod_wsgi, a popular module for hosting Python web applications with the Apache HTTPD web server. He has a keen interest in Docker and platform-as-a-service (PaaS) technologies. Graham is a fellow of the Python Software Foundation and an emeritus member of the Apache Software Foundation.

Presentations

Deploying a scalable JupyterHub environment for running Jupyter notebooks Session

Jupyter notebooks provide a rich interactive environment for working with data. Running a single notebook is easy, but what if you need to provide a platform for many users at the same time. Graham Dumpleton demonstrates how to use JupyterHub to run a highly scalable environment for hosting Jupyter notebooks in education and business.

Joey Echeverria is the director of engineering at Rocana, where he builds applications for scaling IT operations built on the Apache Hadoop platform. Joey is a committer on the Kite SDK, an Apache-licensed data API for the Hadoop ecosystem. Joey was previously a software engineer at Cloudera, where contributed to several ASF projects including Apache Flume, Apache Sqoop, Apache Hadoop, and Apache HBase. Joey is also a coauthor of Hadoop Security, published by O’Reilly.

Presentations

Debugging Apache Spark Session

Apache Spark offers greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark, and more.

Bruno Fernandez-Ruiz is cofounder and CTO at Nexar, where he and his team are using large-scale machine learning and machine vision to capture and analyze millions of sensor and camera readings in order to make our roads safer. Previously, Bruno was a senior fellow at Yahoo, where he oversaw the development and delivery of Yahoo’s personalization, ad targeting, and native advertising teams; his prior roles at Yahoo included chief architect for Yahoo’s cloud and platform and chief architect for international. Prior to joining Yahoo, Bruno founded OneSoup (acquired by Synchronica and now part of the Myriad Group) and YamiGo; was an enterprise architect for Fidelity Investments; served as manager in Accenture’s Center for Strategic Research Group, where he cofounded Meridea Financial Services and Accenture’s claim software solutions group. Bruno holds an MSc in operations research and transportation science from MIT, with a focus on intelligent transportation systems.

Presentations

Pascale Fung is a professor in the Department of Electronic & Computer Engineering at the Hong Kong University of Science & Technology. She is an elected fellow of the Institute of Electrical and Electronic Engineers (IEEE) for her contributions to human-machine interactions and an elected fellow of the International Speech Communication Association for fundamental contributions to the interdisciplinary area of spoken language human-machine interactions. She is keenly interested in promoting AI research for the betterment of the humanity, including AI for ethical fintech and medical practices. Pascale has recently become a partner in the Partnership on AI, an organization of top AI players in the industry and academia focusing on promoting AI to benefit people and the society. She is a member of the Global Future Council on Artificial Intelligence and Robotics, a think tank of the World Economic Forum, and blogs for the forum’s online publication agenda. Pascale has been recognized as one of 2017’s Outstanding Women Professionals and a Woman of Hope in 2014. She holds a PhD in computer science from Columbia University. She is a fluent speaker of seven European and Asian languages.

Presentations

Graham Gear is director of system engineering at Cloudera and an Apache Hadoop committer. Having thirstily read the Google papers that inspired Hadoop and watched as the community coalesced, Graham could clearly see the huge potential of the Hadoop ecosystem and has been contributing to the Hadoop ecosystem and helping organizations take advantage of it for many years. Previously, Graham delivered large-scale distributed systems with a keen analytical focus; he began his career implementing sonar algorithms leveraging MPI on large Beowulf clusters at a defense research institution.

Presentations

Real-world patterns for continuously deployed advanced analytics Session

How can we drive more data pipelines, advanced analytics, and machine learning models into production? How can we do this both faster and more reliably? Graham Gear draws on real-world processes and systems to explain how it's possible to apply continuous delivery techniques to advanced analytics, realizing business value earlier and more safely.

Bas Geerdink is a programmer, scientist, and IT manager at ING, where he is responsible for the fast data systems that process and analyze streaming data. Bas has a background in software development, design, and architecture with broad technical experience from C++ to Prolog to Scala. His academic background is in artificial intelligence and informatics. Bas’s research on reference architectures for big data solutions was published at the IEEE conference ICITST 2013. He occasionally teaches programming courses and is a regular speaker at conferences and informal meetings.

Presentations

Analytics at ING: Technology solutions to create a real-time, data-driven bank Session

Bas Geerdink explains why and how ING is becoming more and more data-driven, sharing use cases, architecture, and technology choices along the way.

Adam Gibson is the CTO and cofounder of Skymind, a deep learning startup focused on enterprise solutions in banking and telco, and the coauthor of Deep Learning: A Practitioner’s Approach.

Presentations

Unsupervised fuzzy labeling using deep learning to improve anomaly detection Session

Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.

Gaurav Godhwani is the technical lead for the Open Budgets India initiative, in association with CBGA. This initiative aims to promote greater transparency, accountability, and public participation in budget processes by making India’s budgets open, usable, and easy to comprehend. Gaurav is also one of the chapter leaders for DataKind Bangalore, where he is building a team of pro bono data scientists to help nonprofits tackle projects addressing critical humanitarian problems.

Presentations

Open Budgets India: Lessons from the front line Session

Most of the India’s budget documents aren’t easily accessible. Those published online are mostly available as unstructured PDFs, making it difficult to search, analyze, and use this crucial data. Gaurav Godhwani discusses the process of creating Open Budgets India and making India’s budgets open, usable, and easy to comprehend.

Ajey Gore is group CTO at GO-JEK, where he helps the company deliver a transport, logistics, lifestyle, and payments platform of 18 products. Ajey has 18 years of experience building core technology strategy across diverse domains. His interests include machine learning, networking, and scaling products. Previously, Ajey founded CodeIgnition (acquired by GO-JEK) and served as ThoughtWorks’s head of technology. An active influencer in the technology community, Ajey organizes conferences, including RubyConf, GopherCon, and devopsdays, through his not-for-profit organization.

Presentations

Impacting a nation Keynote

Drawing on his experience at GO-JEK, Ajey Gore explains how the impossible can be made possible with technology and data insights.

Ishmeet Grewal is a senior research analyst at Accenture Labs, where he is the lead developer responsible for developing and prototyping a comprehensive strategy for automated analytics at scale. Ishmeet has traveled to 25 countries and likes to climb rocks in his free time.

Presentations

DevOps for models: How to manage millions of models in production—and at the edge Session

As Accenture scaled to millions of predictive models, it required automation to ensure accuracy, prevent false alarms, and preserve trust. Teresa Tung, Ishmeet Grewal, and Jurgen Weichenberger explain how Accenture implemented a DevOps process for analytical models that's akin to software development—guaranteeing analytics modeling at scale and even in noncloud environments at the edge.

Arwen Griffioen is a data scientist at Zendesk, where she works on the team producing deep learning solutions for customer self-service. An Oregonian expat who has lived in Melbourne for the past seven years, Arwen is passionate about improving the status of under represented groups in STEM fields and applying machine learning to make the world a little bit better. She holds a PhD in machine learning with a minor in ecoinformatics.

Presentations

Aha moments in deep learning at Zendesk Session

Chris Hausler and Arwen Griffioen discuss Zendesk's experience with deep learning, using the example of Answer Bot, a question-answering system that resolves support tickets without agent intervention. They cover the benefits Zendesk has already seen and challenges encountered along the way.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him on Twitter.

Presentations

Getting started with TensorFlow Tutorial

Yufeng Guo walks you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng takes you from a conceptual overview all the way to building complex classifiers and explains how you can apply deep learning to complex problems in science and industry.

TensorFlow wide and deep: Data classification the easy way Session

Yufeng Guo demonstrates how to use TensorFlow to easily combine linear regression models and deep neural networks with a machine learning model that has the benefits of both. You'll also learn what is happening under the hood and how you can use this model for your own datasets.

Andreas Hadimulyono is a data warehouse engineer at Grab, where he ensures uninterrupted, error-free uptime while meeting the SLA requirements of business intelligence, analytics, and data science workloads. Previously, Andreas worked for Human Longevity Singapore, where he was responsible for the data pipeline for phenotypic data, which is used for genotype and phenotypes association studies.

Presentations

Streaming analytics at Grab Session

Andreas Hadimulyono discusses the challenges that Grab is facing with the ever-increasing volume and velocity of its data and shares the company's plans to overcome them.

Luke (Qing) Han is the coounder and CEO of Kyligence, which provides a leading intelligent data platform powered by Apache Kylin to simplify big data analytics from on-premises to the cloud. Luke is the cocreator and PMC chair of Apache Kylin, where he contributes his passion to driving the project’s strategy, roadmap, and product design. For the past few years, Luke has been working on growing Apache Kylin’s community, building its ecosystem, and extending its adoption globally. Previously, he was big data product lead at eBay, where he managed Apache Kylin, engaged customers, and coordinated various teams from different geographical locations, and chief consultant at Actuate China.

Presentations

Apache Kylin: Advanced tuning and best practices with KyBot Session

Apache Kylin is an extreme distributed OLAP engine on Hadoop. Well-tuned cubes bring about the best performance with the least cost but require a comprehensive understanding of tuning principles to use. Dong Li and Luke Han explain advanced tuning and introduce KyBot, which helps find and solve bottlenecks in an intelligent way with AI methods performed on log analysis results.

Chris Hausler leads the data science team at Zendesk, a role he describes as turning lots of data into magic, which he does with the help of machine learning, Python, Hadoop, graphs galore, and amazing colleagues. Over his career, he’s held the titles of data scientist, data engineer, researcher, PhD student, consultant, and programmer.

Presentations

Aha moments in deep learning at Zendesk Session

Chris Hausler and Arwen Griffioen discuss Zendesk's experience with deep learning, using the example of Answer Bot, a question-answering system that resolves support tickets without agent intervention. They cover the benefits Zendesk has already seen and challenges encountered along the way.

Presentations

From physical data collection to digital delivery of results: The data journey in developing economies DCS

Alex Chade shares how Dotz used a coalition loyalty program to successfully collect transactional data (down to the SKU level) from its tens of millions of members across a number of segments, including grocery, gas, pharma, apparel, electronics, CPGs, insurance, and credit cards, and did so mostly in the physical world.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can learn more from his videos, blog posts, and conference presentations.

Presentations

Painless real-time scalable serverless data pipelines: What Google Cloud can do for you (sponsored by Google Cloud) Session

Stop worrying about infrastructure; focus on your data and insights. Felipe Hoffa explains how Google Cloud brings easy solutions to previously hard problems.

Stop the fights; embrace data (sponsored by Google) Keynote

Organizations waste hours to endless discussions, and people lose sleep to internet debates. Can big data change this? Google Cloud is here to help. Felipe Hoffa explains that solid data-based conclusions are possible when stakeholders have easy access to analyze all relevant data.

Mick Hollison is chief marketing officer at Cloudera, where he leads the company’s worldwide marketing efforts, including advertising, brand, communications, demand, partner, solutions, and web. Mick has had a successful 25-year career in enterprise and cloud software. Previously, he was CMO of sales acceleration at machine learning company InsideSales.com, where, under his leadership, the company pioneered a shift to data-driven marketing and sales that has served as a model for organizations around the globe; was global vice president of marketing and strategy at Citrix, where he led the company’s push into the high-growth desktop virtualization market; managed executive marketing at Microsoft; and held numerous leadership positions at IBM Software. Mick is an advisory board member for InsideSales and a contributing author to Inc.com. He is also an accomplished public speaker who has shared his insightful messages about the business impact of technology with audiences around the world. Mick holds a bachelor of science in management from the Georgia Institute of Technology.

Presentations

Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it Session

Mick Hollison shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.

Siri: The journey to consolidation Keynote

Twenty years ago, a company implored us to “think different” about personal computers. Today, Apple continues to live and breathe that legacy. It’s evident in the machine learning and analytics architectures that power many of the company’s most innovative applications. Cesar Delgado joins Mick Hollison to discuss how Apple is using its big data stack and expertise to solve non-data problems.

Yiqun Hu is the head of data for SP Digital, where he is responsible for leading the data team on the development of machine learning capabilities for energy and utility applications. Previously, he helped several organizations build data-driven products such as image recognition systems and recommendation engines. Yiqun is the author of 30+ scientific publications in the machine learning area, with over 1,400 citations. He holds a PhD from Nanyang Technological University and a bachelor’s degree in computer science from Xiamen University.

Presentations

Energy monitoring with a self-taught deep network Session

Energy usage is a significant part of daily life, so the ability to monitor this use offers a number of benefits, from cost savings to improved safety. A key challenge is the lack of labeled data. Yiqun Hu shares a new solution: a RNN-based network trained to learn good features from unlabeled data.

Mark Hunter is chief data officer at Sainsbury’s Bank. Previously, Mark was head of analytics and digital products at Coles Financial Services, where he worked across Beijing, Hong Kong, and Melbourne. He has served as deputy chair of ISAC, an analytics industry association in Australia.

Presentations

Organizing for machine learning success Session

Deploying machine learning in business requires far more than just selecting an algorithm. You need the right architecture, tools, and team organization to drive your agenda successfully. John Akred and Mark Hunter share practical advice on the technical and human sides of machine learning, based on experience preparing Sainsbury’s for its ML-enabled future.

Masatake Iwasaki is a software engineer at NTT DATA, where he works on OSS professional services, including consulting, system integration, and technical support of open source software
such as Hadoop, Spark, and Storm for enterprise systems. He is also a committer for Apache Hadoop and Apache HTrace (incubating).

Presentations

Fusing a deep learning platform with a big data platform Session

SmartHub and NTT DATA have embarked on a partnership to design next-generation architecture to power the data products that will help generate new insights. YongLiang Xu and Masatake Iwasaki explain how deep learning and other analytics models can coexist on the same platform to address opportunities and challenges in initiatives such as smart cities.

Vickye Jain is a technology manager at ZS Associates, where he jointly runs the big data expertise center. Vickye has extensive experience implementing large-scale big data platforms for Fortune 200 companies in the US. He and his team have implemented very large-scale ETL offloading use cases, data lakes, and high-performance data processing platforms that have had transformation business impact on commercial, R&D, and operations organizations within life sciences.

Presentations

High-performance enterprise data processing with Spark Session

Vickye Jain and Raghav Sharma explain how they built a very high-performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance.

Yousun Jeong is an IT manager of SK Telecom (SKT), South Korea’s largest wireless communications provider, where she focuses on big data analysis.

Presentations

Big telco real-time network analytics Session

Data transfer is one of the most pressing problems for telecom companies, as cost increases in tandem with the growing data requirements. Yousun Jeong details how SKT has dealt with this problem.

Calvin Jia is the release manager for Alluxio and is a core maintainer of the project. He is also the top contributor to the Alluxio project and one of its earliest contributors. Calvin holds a BS from the University of California, Berkeley.

Presentations

Decoupling compute and storage with open source Alluxio Session

Calvin Jia and Haoyuan Li explain how to decouple compute and storage with Alluxio, exploring the decision factors, considerations, and production best practices and solutions to best utilize CPUs, memory, and different tiers of disaggregated compute and storage systems to build out a multitenant high-performance platform.

Xianyan Jia is a software engineer at Intel, where she’s responsible for developing deep learning and machine learning algorithms and pipelines. She is also a contributor to BigDL, a distributed deep learning framework on Apache Spark.

Presentations

Bringing deep learning into big data analytics using BigDL Session

Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL. They also teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark.

Melanie Johnston-Hollitt is an internationally prominent radio astronomer, the director of astronomy and astrophysics at Victoria University of Wellington, and CEO of Peripety Scientific Ltd., an astrophysics and data analytics research company based in Wellington, New Zealand. Melanie serves as chair of the board of the $60 million Murchison Widefield Array (MWA) radio telescope and is a founding member of the board of directors of the Square Kilometre Array (SKA) Organisation Ltd., which is tasked with building the world’s largest radio telescope. In her nearly 20-year career, she has been involved in design, construction, and operation of several major radio telescopes, including the Low Frequency Array in the Netherlands, the MWA in Australia, and the SKA, which will be hosted in both Australia and South Africa. These instruments produce massive quantities of data, requiring new and disruptive technologies to allow value to be extracted from the data deluge. As a result, Melanie’s recent interests span the intersection between radio astronomy, signal processing, and big data analytics. She leads a multidisciplinary team in Wellington that is investigating how best to meet the science challenges of these next-generation instruments in the big data era.

Presentations

Computational challenges and opportunities of astronomical big data Keynote

Keynote with Melanie Johnston-Hollitt

Amit Kapoor is interested in learning and teaching the craft of telling visual stories with data. At narrativeVIZ Consulting, Amit uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Amit also teaches storytelling with data as guest faculty in executive courses at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. He has more than 12 years of management consulting experience with AT Kearney in India, Booz & Company in Europe, and more recently for startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad. Find more about him at Amitkaps.com.

Presentations

Interactive visualization for data science Tutorial

One of the challenges in traditional data visualization is that they are static and have bounds on limited physical/pixel space. Interactive visualizations allows us to move beyond this limitation by adding layers of interactions. Bargava Subramanian and Amit Kapoor teach the art and science of creating interactive data visualizations.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work she enjoys playing with fire, riding scooters, and dancing.

Presentations

Debugging Apache Spark Session

Apache Spark offers greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark, and more.

Extending Spark ML: Adding custom pipeline stages to Spark Session

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau introduces Spark’s ML pipelines and explains how to extend them with your own custom algorithms, allowing you to take advantage of Spark's meta-algorithms and existing ML tools.

Markus Kirchberg is CEO of Wismut Labs Pte. Ltd., where he leads a team of diverse technical experts that help modernize and transform clients’ products, services, and operations. Markus has over 20 years experience in research and technology-driven innovation, and his career spans academia, dedicated research centers, and industrial research and incubation labs. Previously, Markus was the head of technology innovation at Deep Labs, where he was responsible for driving and delivering technology innovation across the Asia Pacific region; headed Visa Labs, Asia Pacific; served as an expert at HP Labs Singapore, where he led various innovation initiatives on next generation, cross-domain data analytics platforms; worked as a research fellow and principal investigator at the Institute for Infocomm Research at A*STAR; and was a lecturer at Massey University, New Zealand. Markus’s skill set includes full innovation lifecycle management, automating infrastructure, cloud computing, data management at multipetabyte scale, data privacy, deep learning, emerging technologies, the internet of things, large-scale data analytics, and extreme transaction processing. He has extensive experience in healthcare, logistics, payments analytics and processing, risk management, and transportation.

Presentations

Payment fraud detection and prevention in the age of big data, network science, and AI Session

As the share of digital payments increases so does payment fraud, which has almost tripled between 2013 and 2016. Markus Kirchberg explains how recent advances in AI and machine learning, decision sciences, and network sciences are driving the development of next-generation payment fraud capabilities for fraud scoring, deceptive merchant detection, and merchant compromise detection.

Mike Koelemay runs the data science team within advanced analytics at Sikorsky, where he is responsible for bringing state-of-the-art analytics and algorithm technologies to support the ingestion, processing, and serving of data collected onboard thousands of aerospace assets around the world. Drawing on his 10+ years of experience in applied data analytics for integrated system health management technologies, Mike works with other software engineers, data architects, and data scientists to support the execution of advanced algorithms, data mining, signal processing, system optimization, and advanced diagnostics and prognostics technologies, with a focus on rapidly generating information from large, complex datasets.

Presentations

Where data science meets rocket science: Data platforms and predictive analytics for aerospace DCS

Sikorsky collects data onboard thousands of helicopters deployed worldwide that is used for fleet management services, engineering analyses, and business intelligence. Mike Koelemay offers an overview of the data platform that Sikorsky has built to manage the ingestion, processing, and serving of this data so that it can be used to rapidly generate information to drive decision making.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. He specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. In addition to his client-facing consulting and training, Jared is an adjunct professor of statistics at Columbia University and the organizer of the New York Open Statistical Programming Meetup and the New York R Conference. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world and was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Machine learning in R Tutorial

Modern statistics has become almost synonymous with machine learning—a collection of techniques that utilize today's incredible computing power. Jared Lander walks you through the available methods for implementing machine learning algorithms in R and explores underlying theories such as the elastic net, boosted trees, and cross-validation.

Making R go faster and bigger Session

One common (but false) knock against R is that it doesn't scale well. Jared Lander shows how to use R in a performant matter both in terms of speed and data size and offers an overview of packages for running R at scale.

Philip Langdale is the engineering lead for cloud at Cloudera. He joined the company as one of the first engineers building Cloudera Manager and served as an engineering lead for that project until moving to working on cloud products. Previously, Philip worked at VMware, developing various desktop virtualization technologies. Philip holds a bachelor’s degree with honors in electrical engineering from the University of Texas at Austin.

Presentations

A deep dive into running big data workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Jason Wang, and Fahd Siddiqui lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud, highlighting cloud infrastructure best practices and illustrating how data engineering workloads interoperate with data analytic engines.

How to successfully run data pipelines in the cloud Session

With its scalable data store, elastic compute, and pay-as-you-go cost model, cloud infrastructure is well-suited for large-scale data engineering workloads. Kostas Sakellis explores the latest cloud technologies, focusing on data engineering workloads, cost, security, and ease-of-use implications for data engineers.

Tony Lee is CISO of JD.com, China’s version of Amazon. Previously, Tony was the chief architect for cloud security at Baidu, the founder and CTO of AnQuanBao, a company that provides a SaaS based web site protection (acquired by Baidu); the executive committee chair of the IEEE Industry Connection Group; and a senior software development manager at Microsoft. He has 10 years of industry experience in enterprise and cloud software products, including Microsoft Forefront, Azure, and Windows. Tony holds a BS in computer science and electrical engineering from the University of California, Berkeley, and an MS from the University of California, Los Angeles, in the field of data mining. He has two US patents, in intelligent analysis of software and behavior-based data mining, respectively, with four more patent applications filed and in progress.

Steve Leonard is the founding CEO of SGInnovate, a private limited company wholly owned by the Singapore government. Capitalizing on the science and technology research for which Singapore has gained a global reputation, Steve’s team works with local and international partners, including universities, venture capitalists, and major corporations, to help technical founders imagine, start, and scale globally relevant early-stage technology companies from Singapore. A technology-industry leader with a wide range of experience, Steve has played a key role in building several global companies in areas such as software, hardware, and services. Previously, he was the executive deputy chairman of the Infocomm Development Authority (IDA), a government statutory board under the purview of Singapore’s Ministry of Communications and Information, where he was responsible for various aspects of the information technology and telecommunications industries in Singapore on a national level. Steve serves on the advisory boards of a number of universities and organizations in Singapore and is an independent non-executive director of AsiaSat, a Hong Kong Stock Exchange-listed commercial operator of communication spacecraft. Although born in the US, Steve considers himself a member of the larger global community, having lived and worked outside the US for more than 25 years.

Presentations

Technology for humanity Keynote

Steve Leonard details how Singapore is bringing together ambitious and capable individuals and teams to imagine, start, build, and scale technology that can solve the world’s toughest challenges.

Dong Li is a technical partner and senior software architect at Kyligence. Dong is also an Apache Kylin committer and PMC member and the tech lead for KyBot. Previously, Dong was a senior software engineer in the Analytics Data Infrastructure Department at eBay and a software development engineer in the Cloud and Enterprise Department at Microsoft, where he was a core member of the dynamics APAC team, responsible for developing next-generation cloud-based ERP solutions. Dong holds both a bachelor’s and master’s degree from Shanghai Jiao Tong University.

Presentations

Apache Kylin: Advanced tuning and best practices with KyBot Session

Apache Kylin is an extreme distributed OLAP engine on Hadoop. Well-tuned cubes bring about the best performance with the least cost but require a comprehensive understanding of tuning principles to use. Dong Li and Luke Han explain advanced tuning and introduce KyBot, which helps find and solve bottlenecks in an intelligent way with AI methods performed on log analysis results.

Haoyuan Li is founder and CEO of Alluxio (formerly Tachyon Nexus), a memory-speed virtual distributed storage system. Before founding the company, Haoyuan was working on his PhD at UC Berkeley’s AMPLab, where he cocreated Alluxio. He is also a founding committer of Apache Spark. Previously, he worked at Conviva and Google. Haoyuan holds an MS from Cornell University and a BS from Peking University.

Presentations

Decoupling compute and storage with open source Alluxio Session

Calvin Jia and Haoyuan Li explain how to decouple compute and storage with Alluxio, exploring the decision factors, considerations, and production best practices and solutions to best utilize CPUs, memory, and different tiers of disaggregated compute and storage systems to build out a multitenant high-performance platform.

Simon Lidberg is a solution architect within Microsoft’s Data Insights Center of Excellence. He has worked with database and data warehousing solutions for almost 20 years in a various of industries and has more recently focused on analysis, BI, and big data. Simon is the author of Getting Started with SQL Server 2012 Cube Development.

Presentations

The value of a data science center of excellence (COE) Session

As organizations turn to data-driven strategies, they are also increasingly exploring the creation of a data science or analytic center of excellence (COE). Benjamin Wright-Jones and Simon Lidberg outline the building blocks of a center of excellence and describe the value for organizations embarking on data-driven strategies.

Dr Lim Ee Peng is a professor at the Singapore Management University (SMU). His research interests include social media analytics, information integration, and information retrieval. At SMU, he is also the Director of the Living Analytics Research Centre, an NRF supported research centre focusing on data analytics research for smart nation application domains.

Presentations

Talent Flow Behaviour Analytics: A Data Driven Approach to Human Capital Management Session

Analyzing talent flow behavior is important for the understanding of job preference and career progression of working individuals. When analyzed at the workforce population level, talent flow analytics helps to gain insights of talent flow and organization competition.

Yu-Xi Lim is lead data scientist at Teralytics, where he leads the technical team in the company’s Singapore office. Yu-Xi is interested in applying data science to retail and travel. Previously, he led teams at Southeast Asian ecommerce giant Lazada and at TravelShark; was vice president of engineering at payment startup Fastacash; and was a software engineer in Microsoft’s Windows Division. Yu-Xi holds a PhD in electrical and computer engineering from Georgia Tech, where he did research on WiFi positioning systems.

Presentations

Distributed real-time highly available stream processing Session

Yu-Xi Lim and Michal Wegrzyn outline a high-throughput distributed software pattern capable of processing event streams in real time. At its core, the pattern relies on functional reactive programming idioms to shard and splice state fragments, ensuring high horizontal scalability, reliability, and high availability.

Zhihao Lin started Teralytics’s Asia business and is currently the company’s head of Asia, where he is responsible for data science, software engineering, and business development. Previously, Zhihao worked for cybersecurity and advanced analytics companies in Switzerland, Singapore, and India. He holds an MS in computer science from Eidgenössische Technische Hochschule Zürich (ETH Zürich) and a BE in computer engineering (with first-class honors) from the National University of Singapore.

Presentations

Moving a smart nation: Using telco data for public transport Session

With a rapidly growing population and the pursuit of a car-lite Singapore and Hong Kong, there is a need to improve the efficiency and responsiveness of public transportation. Zhihao Lin explores how telco-enabled data analytics empower day-to-day operational effectiveness and forward planning of public transport networks in Singapore.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Responsible deployment of machine learning Keynote

Machine learning models are becoming increasingly widely used and deployed. Ben Lorica explains how to guard against flaws and failures in your machine learning deployments.

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

More data more problems DCS

Details to come

Top five mistakes when writing streaming applications Session

Ted Malaska shares the top five mistakes that no one talks about when you start writing your streaming app along with the practices you'll inevitably need to learn along the way.

Peng Meng is a senior software engineer on the big data and cloud team at Intel, where he focuses on Spark and MLlib optimization. Peng is interested in machine learning algorithm optimization and large-scale data processing. He holds a PhD from the University of Science and Technology of China.

Presentations

Apache Spark ML and MLlib tuning and optimization: A case study on boosting the performance of ALS by 60x Session

Apache Spark ML and MLlib are hugely popular in the big data ecosystem, and Intel has been deeply involved in Spark from a very early stage. Peng Meng outlines the methodology behind Intel's work on Spark ML and MLlib optimization and shares a case study on boosting the performance of Spark MLlib ALS by 60x in JD.com’s production environment.

John Mertic is director of program management for ODPi and the Open Mainframe Project at the Linux Foundation. John comes from a PHP and open source background. Previously, he was director of business development software alliances at Bitnami, a developer, evangelist, and partnership leader at SugarCRM, board member at OW2, president of OpenSocial, and a frequent speaker at conferences around the world. As an avid writer, John has published articles on IBM Developerworks, Apple Developer Connection, and PHP Architect and authored The Definitive Guide to SugarCRM: Better Business Applications and Building on SugarCRM.

Presentations

Big data on the rise: Views of emerging trends and predictions from real-life end users Session

John Mertic and Cupid Chan share real end-user perspectives from companies like GE on how they are using big data tools, challenges they face, and where they are looking to focus investments—all from a vendor-neutral viewpoint.

Harjinder Mistry is a principal research engineer at Ola, where he is building a cloud-native data-science platform to solve challenging problems of fleet management. Previously, he engineered data platforms for a couple of interesting data-science projects: OpenShift.io at Red Hat and the Watson ML platform at IBM. Earlier, he spent several years in the DB2 SQL Query Optimizer team, building and fixing the mathematical model that decides the query execution plan. Harjinder holds an MTech from IIIT, Bangalore, India.

Presentations

A recommendation system for wide transactions Session

Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud.

Engineering cloud-native machine learning applications Session

In the current Agile business environment, where developers are required to experiment multiple ideas and also react to various situations, doing cloud-native development is the way to go. Harjinder Mistry and Bargava Subramanian explain how to design and build a microservices-based cloud-native machine learning application.

Prateek Nagaria is a data scientist for the Data Team. Prateek is an advanced analytics expert with more than five years of experience. He specializes in business analytics, big data technologies, and statistical modeling as well as programming languages like R, Python, Java, C, C++. Prateek holds a master’s degree in enterprise business analytics from the National University of Singapore and a bachelor’s degree in computer science and engineering.

Presentations

Forecasting intermittent demand: Traditional smoothing approaches versus the Croston method Session

Most data scientists use traditional methods of forecasting, such as exponential smoothing or ARIMA, to forecast a product demand. However, when the product experiences several periods of zero demand, approaches such as Croston may provide a better accuracy over these traditional methods. Prateek Nagaria compares traditional and Croston methods in R on intermittent demand time series.

Paco Nathan leads the Learning Group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

AI within O'Reilly Media Session

Paco Nathan explains how O'Reilly employs AI, from the obvious (chatbots, case studies about other firms) to the less so (using AI to show the structure of content in detail, enhance search and recommendations, and guide editors for gap analysis, assessment, pathing, etc.). Approaches include vector embedding search, summarization, TDA for content gap analysis, and speech-to-text to index video.

Human-in-the-loop: a design pattern for managing teams that leverage ML Session

Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts.

Kyungtaak Noh is a software manager at SK Telecom. A software architect and full stack web application developer, Kyungtaak has been working in the area of visualizing and coordinating with big data applications.

Presentations

Big data: The best way to truly understand customers in Telco DCS

In the telecommunication industry, quality of service in networks is the customer’s top concern, but it is difficult to analyze due to the increasingly massive volume of data. Kyungtaak Noh and Jisung Kim offer their solution—a quality management system that integrates Hadoop and big data technology—and explain how they use it to efficiently visualize and utilize big data.

Supreet Oberoi is vice president of engineering, IoT, and big data at Oracle. A technology executive with over 20 years of experience building products and solutions for real-time, distributed, and big data analytical applications, Supreet has held technical and leadership roles at Oracle, Concurrent, American Express, Real-Time Innovations, Agile, Microsoft, and many other privately held Silicon Valley companies. He is also the lead mentor for Stanford student startup accelerator startX. Supreet holds a BS in computer sciences with highest honors from the University of Texas at Austin and an MS in computer sciences from Stanford University. He is widely published and often presents at conferences. In his free time, Supreet is reconnecting with an old passion—painting.

Presentations

Querying time series patterns with SAX Session

Time series data is any dataset that is plotted over a range of time. Often, in IoT use cases, what is of interest is finding a pattern in the sequence of measurements. However, queries on time series data do not traditionally scale. Supreet Oberoi explains how Oracle adapted and extended symbolic aggregate approximation (SAX) to solve such challenges.

Jean-Baptiste Onofré is a fellow and software architect at cloud and big data integration software company Talend. An ASF member and contributor to roughly 20 different Apache projects, Jean-Baptiste specializes in both system integration and big data. He is also a champion and PPMC on multiple Apache Beam projects.

Presentations

How Apache Beam can advance your enterprise workloads Session

Apache Beam allows data pipelines to work in batch, streaming, and a variety of open source and private cloud data processing backends, including Apache Flink, Apache Spark, and Google Cloud Dataflow. Jean-Baptiste Onofré offers an overview of Apache Beam's programming model, explores mechanisms for efficiently building data pipelines, and demos an IoT use case dealing with MQTT messages.

Matteo is the Head of Data Engineering at DBS bank overseeing the design and development of the entire DBS big data compute platform. Matteo has more than 15 years of experience in software engineering. In the recent years he has been focusing on scalable BigData platforms and machine learning, specifically using Hadoop and Spark. Matteo has previously held different roles in startup companies and MNCs: he has led engineering teams at DataRobot, Bubbly, Microsoft and Nokia.

Presentations

Data production pipelines: Legacy, practices, and innovation Session

Modern engineering requires machine learning engineers, who are needed to monitor and implement ETL and machine learning models in production. Natalino Busa shares technologies, techniques, and blueprints on how to robustly and reliably manage data science and ETL flows from inception to production.

Radha works as an Enterprise Data Scientist at Thomson Reuters. His work involves applying machine learning and quantitative financial modeling techniques to large datasets in order
to solve specific problems in the financial sector. Prior to Reuters, he has worked as a
portfolio manager at Goldman Sachs Asset Management. He has more than a decade of
experience in building financial and statistical models.
Radha has obtained his masters in financial engineering degree from City University of
New York, a post graduate degree in management from Indian Institute of Management,
Indore and a B.Tech in Civil Engineering from Indian Institute of Technology, Madras.

Presentations

Practical applications for graph techniques in supply chain analysis and finance Session

Graphical techniques are increasingly being used for big data. These techniques can be broadly classified into the three C's: centrality, clustering, and connectedness. Eric Tham explains how these concepts are applied to supply chain analysis and financial portfolio management.

Clifton Phua is a director at NCS Group, leading a team of data scientists working on artificial intelligence, machine learning, and advanced analytics under the Smart and Safe City Centre of Excellence. Previously, Clifton worked at the SAS Institute on SAS analytics, where he specialized in big data analytics in public security (attack and disaster preparation, recovery, and response, cybersecurity, internal security, and predictive policing) and fraud (government, banking, and insurance); was a data scientist in the Data Analytics Department (formerly known as the Data Mining Department) at the Institute for Infocomm Research (I2R) at the Agency for Science, Technology, and Research (A*STAR) in Singapore, where he focused on web monitoring of companies and technologies, assistive technology for people with dementia, and mobile phone activity recognition and worked on real-world energy-related analytics projects to improve parts of the smart grid and other big data applications; and served as a research fellow in the Data Mining Laboratory within the Department of Industrial Engineering at Seoul National University, South Korea. Clifton holds a PhD in identity crime detection and a bachelor’s degree with first-class honors, both from Monash University, Australia.

Presentations

Advanced analytics for a safe city Smart Cities

Clifton Phua offers an overview of several key applications of advanced analytics related to public safety, illustrating the potential value and insights that advanced analytics can bring to a safe city.

Johnson Poh heads the data science practice at DBS’s Big Data Analytics Center of Excellence, where he drives the development of core data science capabilities for enhancing decision analysis. He spent the past decade leading teams in applying statistical learning models across government, pharmaceutical, and financial industries. Johnson holds a postgraduate degree in statistical computing from Yale University and bachelor degrees in mathematics, statistics, and economics from UC Berkeley.

Presentations

Data science at team scale: Considerations for sharing, collaborating, and getting to production Session

Data science alone is easy. Data science with others, in the enterprise, on shared distributed systems, requires a bit more work. Thomas Dinsmore and Johnson Poh share common technology considerations and patterns for collaboration in large teams and best practices for moving machine learning into production at scale.

Philips is the principal engineer in Living Analytics Research Centre of Singapore Management University. His research interests include social media mining, job analytics, and machine learning.

Presentations

Talent Flow Behaviour Analytics: A Data Driven Approach to Human Capital Management Session

Analyzing talent flow behavior is important for the understanding of job preference and career progression of working individuals. When analyzed at the workforce population level, talent flow analytics helps to gain insights of talent flow and organization competition.

Michael Prorock is founder and CTO at mesur.io. Michael is an expert in systems and analytics, as well as in building teams that deliver results. Previously, he was director of emerging technologies for the Bardess Group, where he defined and implemented a technology strategy that enabled Bardess to scale its business to new verticals across a variety of clients, and worked in analytics for Raytheon, Cisco, and IBM, among others. He has filed multiple patents related to heuristics, media analysis, and speech recognition. In his spare time, Michael applies his findings and environmentally conscious methods on his small farm.

Presentations

Smart agriculture: Blending IoT sensor data with visual analytics on Apache Hive and Spark DCS

Mike Prorock and Hugo Sheng offer an overview of mesur.io, a game-changing climate awareness solution that utilizes Apache Hive, Spark, ESRI, and Qlik. Mesur.io combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market.

Xie Qi is a senior software engineer on the big data engineering team at Intel China, where he works on Spark optimization for Intel platforms. Xie has broad experience across big data, multimedia, and wireless.

Presentations

FPGA-based acceleration architecture for Spark SQL Session

Xie Qi and Quanfu Wang offer an overview of a configurable FPGA-based Spark SQL acceleration architecture that leverages FPGAs' very high parallel computing capability to tremendously accelerate Spark SQL queries and FPGAs' power efficiency to lower power consumption.

Yanyu Qu is a data engineer on Grab’s data engineering team, where he works on Spark and Presto’s data gateway. Previously, he worked at FunPlus, App Annie, IBM, and Teradata.

Presentations

Operationalizing Presto in the cloud: Lessons and mistakes Session

Grab uses Presto to support operational reporting (batch and near real-time), ad hoc analyses, and its data pipeline. Currently, Grab has 5+ clusters with 100+ instances in production on AWS and serves up to 30K queries per day while supporting more than 200 internal data users. Feng Cheng and Yanyu Qu explain how Grab operationalizes Presto in the cloud and share lessons learned along the way.

Kira Radinsky is the chief scientist and director of data science at eBay, where she is building the next-generation predictive data mining, deep learning, and natural language processing solutions that will transform ecommerce. She also serves as a visiting professor at the Technion, Israel’s leading science and technology institute, where she focuses on the application of predictive data mining in medicine. Kira cofounded SalesPredict (acquired by eBay in 2016), a leader in the field of predictive marketing—the company’s solutions that leveraged large-scale data mining to predict sales conversions. One of the up-and-coming voices in the data science community, Kira is pioneering the field of web dynamics and temporal information retrieval. She gained international recognition for her work at Microsoft Research, where she developed predictive algorithms that recognized the early-warning signs of globally impactful events, including political riots and disease epidemics. She was named one of MIT Technology Review’s 35 young innovators under 35 for 2013 and one of Forbes’s 30 under 30 rising stars in enterprise technology for 2015; in 2016, she was recognized as woman of the year by Globes. Kira is a frequent presenter at global tech events, including TEDx and the World Wide Web Conference, and she has published in Harvard Business Review.

Presentations

Mining electronic health records and the web for drug repurposing Keynote

Kira Radinsky offers an overview of a system that jointly mines 10 years of nation-wide medical records of more than 1.5 million people and extracts medical knowledge from Wikipedia to provide guidance about drug repurposing—the process of applying known drugs in new ways to treat diseases.

Syed Rafice is a senior system engineer at Cloudera, where he specializes in big data on Hadoop technologies and is responsible for designing, building, developing, and assuring a number of enterprise-level big data platforms using the Cloudera distribution. Syed also focuses on both platform and cybersecurity. He has worked across multiple sectors, including government, telecoms, media, utilities, financial services, and transport.

Presentations

Smart cities, the smart grid, the IoT, and big data Smart Cities

Smart cities and the electricity smart grid have become leading examples of the IoT, in which distributed sensors describe mission-critical behavior by generating billions of metrics daily. Mark Donsky and Syed Rafice show how smart utilities and cities rely on Hadoop to capture, analyze, and harness this data to increase safety, availability, and efficiency across the entire electricity grid.

Greg Rahn is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over his 20-year career, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently, product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Rethinking data marts in the cloud: Common architectural patterns for analytics Session

Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance.

Isaac Reyes is a principal at DataSeer, where he leads a team that delivers in-house training courses in data storytelling, predictive analytics, and machine learning to companies such as Cisco, Ericsson, Hewlett Packard, and Pfizer. A data scientist, trainer and TEDx speaker who lives, breathes and dreams data, previously, Isaac lectured in statistical theory at the Australian National University and worked as a data scientist in the private sector.

Presentations

The art of data storytelling Session

Isaac Reyes explores the art and science of data storytelling, covering the essential elements of a good data story, chart design and why it matters, the Gestalt principals of visual perception and how they can be used to tell better stories with data, and how to make over a poor visualization.

Steve Ross is the director of product management at Cloudera, where he focuses on security across the big data ecosystem, balancing the interests of citizens, data scientists, and IT teams working to get the most out of their data while preserving privacy and complying with the demands of information security and regulations. Previously, at RSA Security and Voltage Security, Steve managed product portfolios now in use by the largest global companies and hundreds of millions of users.

Presentations

GDPR: Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Steven Ross and Mark Donsky outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Nikki Rouda is the cloud and core platform director at Cloudera. Nik has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their IT challenges. His career spans big data, analytics, machine learning, AI, storage, networking, security, and the IoT. Nik holds an MBA from Cambridge and an ScB in geophysics and math from Brown.

Presentations

Good everywhere: Managing security and governance in a hybrid- and multicloud world Session

Managing the security and governance of big data can be challenging on-premises but becomes far more difficult in a heterogeneous environment spanning a public cloud or across multiple cloud services. Nikki Rouda shares unbiased best practices to ensure your data is under control everywhere.

Kostas Sakellis is the lead and engineering manager of the Apache Spark team at Cloudera. Kostas holds a bachelor’s degree in computer science from the University of Waterloo, Canada.

Presentations

How to successfully run data pipelines in the cloud Session

With its scalable data store, elastic compute, and pay-as-you-go cost model, cloud infrastructure is well-suited for large-scale data engineering workloads. Kostas Sakellis explores the latest cloud technologies, focusing on data engineering workloads, cost, security, and ease-of-use implications for data engineers.

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists, particularly focusing on the Apache Spark ecosystem. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka. Neelesh holds a master’s degree in computer science with a focus on cloud computing from North Carolina State University and a bachelor’s degree in computer engineering from the University of Mumbai, India.

Presentations

Apache Spark in the hands of data scientists Session

Neelesh Srinivas Salian offers an overview of the data platform used by data scientists at Stitch Fix, based on the Spark ecosystem. Neelesh explains the development process and shares some lessons learned along the way.

Kaz Sato is a staff developer advocate on Google’s Cloud Platform team, where he focuses on machine learning and data analytics products, such as TensorFlow, Cloud ML, and BigQuery. Kaz has also led and supported developer communities for Google Cloud for over eight years. He has been an invited speaker at events including Google Cloud Next ’17 SF, Google I/O 2016 and 2017, the 2017 Strata Data Conference in London, the 2016 Strata + Hadoop World in San Jose and NYC, the 2016 Hadoop Summit, and ODSC East 2016 and 2017. Kaz is also interested in hardware and the IoT and has been hosting FPGA meetups since 2013.

Presentations

BigQuery and TensorFlow: How a data warehouse + machine learning enables "smart" queries Session

BigQuery is Google's fully managed, petabyte-scale data warehouse. Its user-defined function realizes "smart" queries with the power of machine learning, such as similarity searches or recommendations on images or documents with feature vectors and neural network prediction. Kazunori Sato demonstrates how BigQuery and TensorFlow together enable a powerful "data warehouse + ML" solution.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning with TensorFlow 2-Day Training

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Tim Seears is area practice director for Asia-Pacific at Think Big, a Teradata company. Previously he was CTO of Big Data Partnership (acquired by Teradata in 2016), which he cofounded after a career spent in the space industry working on NASA’s Cassini orbiter mission at Saturn. Tim and his team established Big Data Partnership as a dominant thought leader throughout the European market, providing data science, data engineering, and big data architecture services to global enterprise customers.

Presentations

Deep learning for recommender systems Tutorial

Tim Seears and Karthik Bharadwaj Thirumalai explain how to apply deep learning to improve consumer recommendations by training neural nets to learn categories of interest using embeddings. They then demonstrate how to extend this with WALS matrix factorization to achieve wide and deep learning—a process which is now used in production for the Google Play Store.

Jonathan Seidman is a software engineer on the partner engineering team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Tushar Shanbhag is head of data strategy and data products at LinkedIn. Tushar is a seasoned executive with track record of building high-growth businesses at market-defining companies such as LinkedIn, Cloudera, VMware, and Microsoft. Most recently, Tushar was vice president of products and design at Arimo, an Andreessen-Horowitz company building data intelligence products using analytics and AI.

Presentations

Privacy by design, not an afterthought: Best practices at LinkedIn Session

LinkedIn houses the most valuable professional data in the world. Protecting the privacy of member data has always been paramount. Shirshanka Das and Tushar Shanbhag outline three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement framework, and a unified data access layer.

Vira Shanty is the chief data officer at Lippo Digital, where she is responsible for the strategy, design, development, and implementation of big data technology and use cases for the Lippo Group. She is also responsible for building the first big data technology and solutions team within the Lippo Group, which is tasked with uncovering new revenue opportunities and driving product innovation.

Presentations

Delivering a big data analytics API with 360-degree customer profile data from multiple industry data sources (sponsored by Kinetica) Session

Vira Shanty explains how the Lippo Group, one of the largest business conglomerates in Indonesia, is integrating data from multiple lines of business into a single big data analytic platform featuring an API layer with subsecond latency and how the company's mantra “deep and fast analytics” is opening new opportunities for improved customer engagement and new revenue streams.

Raghav Sharma is a solution delivery manager at ZS Associates, where he specializes in big data platforms, cloud-based analytical solutions, and information architecture and helps lead the delivery of technology consulting engagements in the big data space for life sciences industry clients.

Presentations

High-performance enterprise data processing with Spark Session

Vickye Jain and Raghav Sharma explain how they built a very high-performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance.

Ofir Sharony is a senior member of MyHeritage’s backend team, where he is currently focused on building pipelines on-premises and in the cloud using batch and streaming technologies. An expert in building data pipelines, Ofir acquired most of his experience planning and developing scalable server-side solutions.

Presentations

From Kafka to BigQuery: A guide for delivering billions of daily events Session

What are the most important considerations for shipping billions of daily events to analysis? Ofir Sharony shares MyHeritage's journey to find a reliable and efficient way to achieve real-time analytics. Along the way, Ofir compares several data loading techniques, helping you make better choices when building your next data pipeline.

Hugo Sheng leads Qlik’s partner engineering organization, which is responsible for both developer relations and the integration of global technology partner solutions with the Qlik platform. Hugo has spent over 20 years in data management and analytics, specializing in highly scalable big data solutions. Over his career, he has held senior roles at Torrent Systems, Ascential Software, and IBM and led sales engineering and services at Expressor Software (acquired by Qlik in 2012). Hugo also spent several years in software development in the medical device industry. Hugo holds a BS in electrical engineering from the University of Houston and an MBA from the Jones Graduate School of Business at Rice University.

Presentations

Smart agriculture: Blending IoT sensor data with visual analytics on Apache Hive and Spark DCS

Mike Prorock and Hugo Sheng offer an overview of mesur.io, a game-changing climate awareness solution that utilizes Apache Hive, Spark, ESRI, and Qlik. Mesur.io combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Unraveling data with Spark using deep learning and other algorithms from machine learning Tutorial

Vartika Singh and Jeffrey Shmain walk you through various approaches using the machine learning algorithms available in Spark ML to understand and decipher meaningful patterns in real-world data. Vartika and Jeff also demonstrate how to leverage open source deep learning frameworks to run classification problems on image and text datasets leveraging Spark.

Fahd Siddiqui is a software engineer at Cloudera, where he’s working on cloud products, such as Cloudera Altus and Cloudera Director. Previously, Fahd worked at Bazaarvoice developing EmoDB, an open source data store built on top of Cassandra. His interests include highly scalable and distributed systems. He holds a master’s degree in computer engineering from the University of Texas at Austin.

Presentations

A deep dive into running big data workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Jason Wang, and Fahd Siddiqui lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud, highlighting cloud infrastructure best practices and illustrating how data engineering workloads interoperate with data analytic engines.

Vartika Singh is a solutions architect at Cloudera with over 12 years of experience applying machine learning techniques to big data problems.

Presentations

Unraveling data with Spark using deep learning and other algorithms from machine learning Tutorial

Vartika Singh and Jeffrey Shmain walk you through various approaches using the machine learning algorithms available in Spark ML to understand and decipher meaningful patterns in real-world data. Vartika and Jeff also demonstrate how to leverage open source deep learning frameworks to run classification problems on image and text datasets leveraging Spark.

Bargava Subramanian is a machine learning engineer based in Bangalore, India. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimization in Python and R around the world. He mentors early-stage startups in their data science journey. Bargava holds a master’s degree in statistics from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations

A recommendation system for wide transactions Session

Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud.

Engineering cloud-native machine learning applications Session

In the current Agile business environment, where developers are required to experiment multiple ideas and also react to various situations, doing cloud-native development is the way to go. Harjinder Mistry and Bargava Subramanian explain how to design and build a microservices-based cloud-native machine learning application.

Interactive visualization for data science Tutorial

One of the challenges in traditional data visualization is that they are static and have bounds on limited physical/pixel space. Interactive visualizations allows us to move beyond this limitation by adding layers of interactions. Bargava Subramanian and Amit Kapoor teach the art and science of creating interactive data visualizations.

Tzu-Li (Gordon) Tai is a software engineer at data Artisans and an Apache Flink committer and PMC member. His main contributions on Flink include work on Flink’s streaming connectors (Kafka, AWS Kinesis, Elasticsearch) and its type serialization stack and state management capabilities. Gordon is a frequent speaker at conferences such as Flink Forward, Flink meetups in Berlin and Taiwan, and several Taiwan-based conferences on the Hadoop ecosystem and data engineering in general.

Presentations

The stream processor as a database: Building event-driven applications with Apache Flink Session

Apache Flink is evolving from a framework for streaming data analytics to a platform that offers a foundation for event-driven applications that replaces the data management aspects that are typically handled by a database in more conventional architectures. Tzu-Li (Gordon) Tai explores the key features that are powering Flink's evolution, along with demonstrations of them in action.

Grace Tang leads experimental research advising for the APAC region at Uber. Previously, Grace was lead data scientist at online real estate platform 99.co. She holds a PhD in neuroscience from Stanford University, where she worked in the Decision Neuroscience Lab, studying how personality traits, emotions, and external stimuli affect decision making.

Presentations

Turning fails into wins Session

Being a data-driven company means that we have to move fast and fail often. But how do we learn to not only be proud of our failures but also turn these fails into wins? Grace Tang explains how to set up experiments so that negative results become epic wins, saving your team time, effort, and money, instead of just being swept under the carpet.

Eric Tham is an associate lecturer at the National University of Singapore. Previously, he was an enterprise data scientist at Thomson Reuters, led the quantitative data science team in a Chinese fintech startup with five million users, and worked in the financial industry in risk management, quantitative development, and energy economics with banks and oil companies. Over his career, he has developed sentiment indices from social media data and is an expert in unstructured data analysis, NLP, and machine learning in financial applications. He is a frequent speaker at conferences and contributed a chapter to the Handbook of Sentiment Analysis in Finance.

Presentations

Practical applications for graph techniques in supply chain analysis and finance Session

Graphical techniques are increasingly being used for big data. These techniques can be broadly classified into the three C's: centrality, clustering, and connectedness. Eric Tham explains how these concepts are applied to supply chain analysis and financial portfolio management.

Jeffrey Theobald is a senior data engineer at Zendesk. Jeffrey has worked in data engineering for eight years, mostly using Python, bash, Ruby, C++, and Java. He has used Hadoop since 2011 and has built analytics and batch processing systems as well as data preparation tools for machine learning.

Presentations

The trials of machine learning at Zendesk Session

Simply building a successful machine learning product is extremely challenging, and just as much effort is needed to turn that model into a customer-facing product. Drawing on their experience working on Zendesk's article recommendation product, Wai Yau and Jeffrey Theobald discuss design challenges and real-world problems you may encounter when building a machine learning product at scale.

Karthik Bharadwaj is a senior data scientist in the Data Science Center of Expertise at Teradata, where he provides analytic thought leadership and generating demand for Teradata products. Karthik has seven years of experience in working in the data management and analytics industry. Previously, he worked as a researcher at IBM Research to develop smarter transportation systems that predict traffic on the Singapore road network. Karthik holds a master’s degree from the National University of Singapore.

Presentations

Deep learning for recommender systems Tutorial

Tim Seears and Karthik Bharadwaj Thirumalai explain how to apply deep learning to improve consumer recommendations by training neural nets to learn categories of interest using embeddings. They then demonstrate how to extend this with WALS matrix factorization to achieve wide and deep learning—a process which is now used in production for the Google Play Store.

Wee Hyong Tok is a principal data science manager for the cloud AI team at Microsoft, where he works with teams to cocreate new value and turn each of the challenges facing organizations into compelling data stories that can be concretely realized using proven enterprise architecture. Wee Hyong has worn many hats in his career, including developer, program/product manager, data scientist, researcher, and strategist, and his range of experience has given him unique super powers to nurture and grow high-performing innovation teams that enable organizations to embark on their data-driven digital transformations using artificial intelligence. He has a passion for leading artificial intelligence-driven innovations and working with teams to envision how these innovations can create new competitive advantage and value for their business and strongly believes in story-driven innovation. He coauthored one of the first books on Azure Machine Learning, Predictive Analytics Using Azure Machine Learning, and authored another demonstrating how database professionals can do AI with databases, Doing Data Science with SQL Server.

Presentations

Bootstrap custom image classification using transfer learning Session

Transfer learning enables you to use pretrained deep neural networks (e.g., AlexNet, ResNet, and Inception V3) and adapt them for custom image classification tasks. Danielle Dean and Wee Hyong Tok walk you through the basics of transfer learning and demonstrate how you can use the technique to bootstrap the building of custom image classifiers.

Training and scoring deep neural networks in the cloud Session

Deep neural networks are responsible for many advances in natural language processing, computer vision, speech recognition, and forecasting. Danielle Dean and Wee Hyong Tok illustrate how cloud computing has been leveraged for exploration, programmatic training, real-time scoring, and batch scoring of deep learning models for projects in healthcare, manufacturing, and utilities.

Teresa Tung is a managing director at Accenture Technology Labs, where she is responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s industry platforms like Accenture Digital Connected Products and the Accenture Analytics Insights Platform. Teresa holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

DevOps for models: How to manage millions of models in production—and at the edge Session

As Accenture scaled to millions of predictive models, it required automation to ensure accuracy, prevent false alarms, and preserve trust. Teresa Tung, Ishmeet Grewal, and Jurgen Weichenberger explain how Accenture implemented a DevOps process for analytical models that's akin to software development—guaranteeing analytics modeling at scale and even in noncloud environments at the edge.

Executive Briefing: Becoming a data-driven enterprise—A maturity model Session

A data-driven enterprise maximizes the value of its data. But how do enterprises emerging from technology and organization silos get there? Teresa Tung explains how to create a data-driven enterprise maturity model that spans technology and business requirements and walks you through use cases that bring the model to life.

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, where she is responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

A deep dive into running big data workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Jason Wang, and Fahd Siddiqui lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud, highlighting cloud infrastructure best practices and illustrating how data engineering workloads interoperate with data analytic engines.

Arun Veettil is the founder, CEO, and chief architect at Skellam AI, a company dedicated to helping businesses develop custom AI solutions. Arun is a tech veteran with over 17 years of industry experience. For the last seven years, Arun has been working at the intersection of machine learning and product development. Previously, he worked at Point Inside, Nordstrom Advanced Analytics, the Walt Disney Company, and IBM. His expertise includes developing machine learning algorithms to run against very large amounts of data and building large-scale distributed applications. Arun holds a master’s degree in computer science from the University of Washington and a bachelor’s degree in electronics engineering from the National Institute of Technology, Allahabad, India.

Presentations

Architecting a text analytics system in the cloud Session

Arun Veettil shares his experience and lessons learned developing a customized, enterprise-level NLP platform to replace a leading text analytics vendor platform.

Carson Wang is a big data software engineer at Intel, where he focuses on developing and improving new big data technologies. He is an active open source contributor to the Apache Spark and Alluxio projects as well as a core developer and maintainer of HiBench, an open source big data microbenchmark suite. Previously, Carson worked for Microsoft on Windows Azure.

Presentations

An adaptive execution mode for Spark SQL Session

Spark SQL is one of the most popular components of Apache Spark. Carson Wang and Yucai Yu explore Intel's efforts to improve SQL performance and offer an overview of an adaptive execution mode they implemented for Spark SQL.

Jason is a software engineer at Cloudera focusing on the cloud.

Presentations

A deep dive into running big data workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Jason Wang, and Fahd Siddiqui lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud, highlighting cloud infrastructure best practices and illustrating how data engineering workloads interoperate with data analytic engines.

Quanfu Wang is a senior architect on Intel’s big data team, where he is working on software optimization and acceleration on information architecture and heterogeneous computing. Previously, Quanfu was a lead software engineer at Alcatel-Lucent, where he worked for the company’s wireline business group.

Presentations

FPGA-based acceleration architecture for Spark SQL Session

Xie Qi and Quanfu Wang offer an overview of a configurable FPGA-based Spark SQL acceleration architecture that leverages FPGAs' very high parallel computing capability to tremendously accelerate Spark SQL queries and FPGAs' power efficiency to lower power consumption.

Zhenhua Wang is a software engineer on JD.com’s AI and big data team, where he works on algorithm research and development for machine learning and computer vision, focusing on image feature representation, large-scale image deduplication, and search.

Presentations

Bringing deep learning into big data analytics using BigDL Session

Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL. They also teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark.

Michal Wegrzyn is a software engineer at Teralytics, where he is responsible for developing batch and streaming big data analytics applications. He previously worked with CellVision and Orange Poland where he built and delivered telco analytics applications. He holds a master’s degree in electronics and telecommunications from AGH University of Science and Technology.

Presentations

Distributed real-time highly available stream processing Session

Yu-Xi Lim and Michal Wegrzyn outline a high-throughput distributed software pattern capable of processing event streams in real time. At its core, the pattern relies on functional reactive programming idioms to shard and splice state fragments, ensuring high horizontal scalability, reliability, and high availability.

Jürgen Weichenberger is a data science senior principal at Accenture Analytics, where he works within resources industries with interests in smart grids and power, digital plant engineering, and optimization for upstream industries and the water industry. Jürgen has over 15 years of experience in engineering consulting, data science, big data, and digital change. In his spare time, he enjoys spending time with his family and playing golf and tennis. Jürgen holds a master’s degree (with first-class honors) in applied computer science and bioinformatics from the University of Salzburg.

Presentations

DevOps for models: How to manage millions of models in production—and at the edge Session

As Accenture scaled to millions of predictive models, it required automation to ensure accuracy, prevent false alarms, and preserve trust. Teresa Tung, Ishmeet Grewal, and Jurgen Weichenberger explain how Accenture implemented a DevOps process for analytical models that's akin to software development—guaranteeing analytics modeling at scale and even in noncloud environments at the edge.

Graham Williams is director of data science at Microsoft, where he is responsible for the Asia-Pacific region, an adjunct professor with the University of Canberra and the Australian National University, and an international visiting professor with the Chinese Academy of Sciences. Graham has 30 years experience as a data scientist leading research and deployments in artificial intelligence, machine learning, data mining, and analytics. Previously, he was principal data scientist with the Australian Taxation Office and lead data scientist with the Australian Government’s Centre of Excellence in Analytics, where he assisted numerous government departments and Australian industry in creating and building data science capabilities. He has also worked on many projects focused on delivering solutions and applications driven by data using machine learning and artificial intelligence technologies. Graham has authored a number of books introducing data mining and machine learning using the R statistical software.

Presentations

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure Session

R has long been criticized for its limitations on scalable data analytics. What's needed is an R-centric paradigm that enables data scientists to elastically harness cloud resources of manifold computing capability for large-scale data analytics. Le Zhang and Graham Williams demonstrate how to operationalize an E2E enterprise-grade pipeline for big data analytics—all within R.

Benjamin Wright-Jones is a solution architect in the Microsoft WW Services CTO Office for Data and AI, where his team helps enterprise customers solve their analytical challenges. Over his career, Ben has worked on some of the largest and most complex data-centric projects around the globe.

Presentations

The value of a data science center of excellence (COE) Session

As organizations turn to data-driven strategies, they are also increasingly exploring the creation of a data science or analytic center of excellence (COE). Benjamin Wright-Jones and Simon Lidberg outline the building blocks of a center of excellence and describe the value for organizations embarking on data-driven strategies.

Mingxi Wu is the vice president of engineering at TigerGraph, a Silicon Valley startup building a world-leading real-time graph data platform. During his 15-year career, Mingxi has focused on database research and data management software building; his recent interests are in building easy to use and high expressive graph query language. Previously, he worked in Microsoft’s SQL Server Group and Oracle’s Relational Database Optimizer Group. He has won research awards from the most prestigious publications in database and data mining, including SIGMOD, KDD, and VLDB. Mingxi holds a PhD from the University of Florida, where he specialized in databases and data mining.

Presentations

TigerGraph: A complete high-performance graph data and analytics platform Session

Mingxi Wu and Yu Xu offer an overview of TigerGraph, a high-performance enterprise graph data platform that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network, allowing them to uncover the implicit patterns and critical insights to drive business growth.

Xiaochang Wu is a senior software engineer on Intel’s big data engineering team, where he helps deliver the best Spark performance on Intel platforms. Xiaochang has more than 10 years’ experience in performance optimization for Intel architecture. He holds a master’s degree in computer science from Xiamen University of China.

Presentations

Spark Structured Streaming helps smart manufacturing Session

Xiaochang Wu explains how to design and implement a real-time processing platform using the Spark Structured Streaming framework to intelligently transform production lines in the manufacturing industry.

YongLiang Xu is the lead data architect for SmartHub, the analytics division of StarHub, where he is responsible for transforming and architecting the next generation of big data architecture. His work includes reengineering SmartHub’s big data platform for real-time processing to support real-time machine learning and experimenting with new Apache projects and optimizing the big data platform for streamlined and seamless performance. Previously, YongLiang was a software engineer at DSO National Laboratories, Singapore, where he developed solutions based on big data technologies.

Presentations

Fusing a deep learning platform with a big data platform Session

SmartHub and NTT DATA have embarked on a partnership to design next-generation architecture to power the data products that will help generate new insights. YongLiang Xu and Masatake Iwasaki explain how deep learning and other analytics models can coexist on the same platform to address opportunities and challenges in initiatives such as smart cities.

Yu Xu is the founder and CEO of TigerGraph, the world’s first native parallel graph database. He is an expert in big data and parallel database systems and has over 26 patents in parallel data management and optimization. Previously, Yu worked on Twitter’s data infrastructure for massive data analytics and was Teradata’s Hadoop architect leading the company’s big data initiatives. Yu holds a PhD in computer science and engineering from the University of California, San Diego.

Presentations

TigerGraph: A complete high-performance graph data and analytics platform Session

Mingxi Wu and Yu Xu offer an overview of TigerGraph, a high-performance enterprise graph data platform that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network, allowing them to uncover the implicit patterns and critical insights to drive business growth.

Wai Chee Yau is a senior data engineer at Zendesk. A polyglot developer who loves working with data and machine learning, she has more than nine years’ experience in data processing, distributed systems, APIs, and system integration across a number of industries. She has completed a PhD in computer vision in 2008.

Presentations

The trials of machine learning at Zendesk Session

Simply building a successful machine learning product is extremely challenging, and just as much effort is needed to turn that model into a customer-facing product. Drawing on their experience working on Zendesk's article recommendation product, Wai Yau and Jeffrey Theobald discuss design challenges and real-world problems you may encounter when building a machine learning product at scale.

Yucai Yu is a software architect at Intel, where he works on Apache Spark upstream development and IA optimization. Previously, he worked at IBM and Citi Bank with a focus on OS, virtualization, storage, and data warehouses.

Presentations

An adaptive execution mode for Spark SQL Session

Spark SQL is one of the most popular components of Apache Spark. Carson Wang and Yucai Yu explore Intel's efforts to improve SQL performance and offer an overview of an adaptive execution mode they implemented for Spark SQL.

Wataru Yukawa is a data engineer at LINE, where he is creating and maintaining a log analysis platform based on Hadoop, Hive, Fluentd, Presto, and Azkaban and working on aggregating log and RDBMS data with Hive and reporting using BI tools.

Presentations

LINE's log analysis platform Session

Data is a very important asset to LINE, one of the most popular messaging applications in Asia. Wataru Yukawa explains how LINE gets the most out of its data using a Hadoop data lake and an in-house log analysis platform.

Le Zhang is a data scientist at Microsoft, where he builds scalable data analytical tools for cloud platforms and data-driven accelerators for solving enterprise-level business problems. Previously, Le worked at a semiconductor company developing an intelligent wafer defect recognition system using machine learning technology. Le holds a PhD in computer engineering.

Presentations

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure Session

R has long been criticized for its limitations on scalable data analytics. What's needed is an R-centric paradigm that enables data scientists to elastically harness cloud resources of manifold computing capability for large-scale data analytics. Le Zhang and Graham Williams demonstrate how to operationalize an E2E enterprise-grade pipeline for big data analytics—all within R.