Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference
Singapore

Speakers

Leading practitioners will share tips, best practices, and real-world expertise in a rich variety of sessions. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Shilpa Aggarwal is an associate partner at McKinsey & Company, where she coleads McKinsey’s analytics in Southeast Asia. With deep expertise in growth opportunities and advanced analytics, Shilpa focuses on B2C industries like technology and banking and advises leading B2C clients in Asia on a broad range of topics from operations and technology to building new digital businesses and developing and driving additional revenues from scientific marketing. Some of Shilpa’s recent work has included developing a long-term strategy for building smart cities for a leading Southeast Asian economy and advising on the building big data business for external monetization; she has also been involved in deploying cutting-edge proofs of concept for banks in analytics and artificial intelligence. Shilpa holds an MBA from Mumbai University and a bachelor’s degree in electronics engineering from Pune University.

Presentations

Executive Briefing: McKinsey & Company Session

Executive Briefing with Shilpa Aggarwal

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data, AI, and data science have great potential for accelerating business, but how do you reconcile business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. John Akred and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business.

Organizing for machine learning success Session

Deploying machine learning in business requires far more than just selecting an algorithm. You need the right architecture, tools, and team organization to drive your agenda successfully. John Akred and Mark Hunter share practical advice on the technical and human sides of machine learning, based on experience preparing Sainsbury’s for its ML-enabled future.

Sarang Anajwala is technical product manager for Autodesk’s next-generation data platform, where he focuses on building the self-service platform for data analytics and data products. Sarang has extensive experience in data architecture and data strategy. Prior to this, he has worked as an architect building robust big data systems, including a next-generation platform for contextual communications and a data platform for IoT. He has filed 3 patents for his innovations in contextual communications space and adaptive designs space.

Presentations

Enabling data-driven decision making: Challenges of logical and physical scale Session

Autodesk's centralized data platform enables data-driven decision making by democratizing analytics across the various teams based on their personas and proficiencies. Sarang Anajwala explores the various user personas of the big data platform, challenges in enabling them for efficient interactions with big data, and his experience navigating these challenges.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Executive Briefing: The Five Dysfunctions of a Data Engineering Team Session

If you’re creating a data engineering team, there are common mistakes and patterns. These lead a data engineering team to either fail or perform at a much lower level. Early project success is predicated on management making sure the team is ready and has all of the skills needed.

Real-time systems with Spark Streaming and Kafka 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time systems with Spark Streaming and Kafka (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Aki Ariga is a field data scientist at Cloudera, where he works on service development with machine learning and natural language processing. His work has included researching spoken dialogue systems, building a large corpus analysis system, and developing services such as recipe recommendations. Aki is a sparklyr contributor. He organizes several tech communities in Japan, including Ruby, machine learning, and Julia.

Presentations

Train, predict, and serve: How to put your machine learning model into production Session

Aki Ariga explains how to put your machine learning model into production, discusses common issues and obstacles you may encounter, and shares best practices and typical architecture patterns of deployment ML models with example designs from the Hadoop and Spark ecosystem using Cloudera Data Science Workbench.

Carme Artigas is the founder and CEO of Synergic Partners, a strategic and technological consulting firm specializing in big data and data science (acquired by Telefónica in 2015). She has more than 20 years of extensive expertise in the telecommunications and IT fields and has held several executive roles in both private companies and governmental institutions. Carme is a member of the Innovation Board of CEOE and the Industry Affiliate Partners at Columbia University’s Data Science Institute. An in-demand speaker on big data, she has given talks at several international forums, including Strata Data Conference, and collaborates as a professor in various master’s programs on new technologies, big data, and innovation. Carme was recently recognized as the only Spanish woman among the 30 most influential women in business by Insight Success. She holds an MS in chemical engineering, an MBA from Ramon Llull University in Barcelona, and an executive degree in venture capital from UC Berkeley’s Haas School of Business.

Presentations

Executive Briefing: Analytics centers of excellence as a way to accelerate big data adoption by business Session

Big data technology is mature, but its adoption by business is slow, due in part to challenges like a lack of resources or the need for a cultural change. Carme Artigas explains why an analytics center of excellence (ACoE), whether internal or outsourced, is an effective way to accelerate the adoption and shares an approach to implementing an ACoE.

Keynote with Carme Artigas Keynote

Keynote with Carme Artigas

Ricky Barron is founder and principal at InfoStrategy, a data management and analytics strategy consultantancy helping medium- to large-enterprises develop and operationalize insights for their businesses.

Presentations

Executive Briefing: How to structure, recruit, operationalize, and maintain your insights organization Session

To many organizations, big data analytics is still a solution looking for a problem. Ricky Barron shares practical methods for getting the best out of your big data analytics capability and explains why establishing an "insights group" can improve the bottom line, drive performance, optimize processes, and create new data-driven products and solutions.

Zhaojuan Bianny Bian is an engineering manager in Intel’s Software and Service Group, where she focuses on big data cluster modeling to provide services in cluster deployment, hardware projection, and software optimization. Bianny has more than 10 years of experience in the industry with performance analysis experience that spans big data, cloud computing, and traditional enterprise applications. She holds a master’s degree in computer science from Nanjing University in China.

Presentations

Best practices with Kudu: An end-to-end user case from the automobile industry Session

Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.

Joshua Bloom is vice president of data and analytics at GE Digital, where he serves as the technology and research lead bringing machine learning applications to market within the GE ecosystem. Previously, Joshua was cofounder and CTO of Wise.io (acquired by GE Digital in 2016). Since 2005, he has also been an astronomy professor at the University of California, Berkeley, where he teaches astrophysics and Python for data science. Josh has been awarded the Moore Foundation Data-Driven Investigator Prize and the Pierce Prize from the American Astronomical Society; he is also a former Sloan fellow, a junior fellow of the Harvard Society, and a Hertz Foundation fellow. Joshua holds a PhD from Caltech and degrees from Harvard and Cambridge University.

Presentations

Industrial machine learning Keynote

The ongoing digitization of the industrial-scale machines that power and enable human activity is itself a major global transformation. Joshua Bloom explains why the real revolution—in efficiencies and in improved and saved lives—will happen when machine learning automation and insights are properly coupled to the complex systems of industrial data.

Alexandre Chade is founder and executive chairman at Dotz, the largest coalition loyalty program in Latin America, with more than 25 million members and 200+ affiliated companies. Alex is a serial entrepreneur. Previously, he founded, developed, and sold or IPOed a number of companies in telecommunications, media, entertainment, tourism, real estate, and loyalty. He has also founded and presides over a range of NGOs in Brazil and abroad. Alex holds law and business degrees, undertaken in part through graduate and extension courses at MIT, Harvard, and other schools.

Presentations

From physical data collection to digital delivery of results: The data journey in developing economies DCS

Alex Chade shares how Dotz used a coalition loyalty program to successfully collect transactional data (down to the SKU level) from its tens of millions of members across a number of segments, including grocery, gas, pharma, apparel, electronics, CPGs, insurance, and credit cards, and did so mostly in the physical world.

Cupid Chan is a managing partner at 4C Decision, where he helps clients ranging from Fortune 500 companies to the public sector leverage the power of data, analytics, and technology to gain invaluable insight to improve various aspects on their businesses. Previously, he was one of the key players in the construction of a world-class BI platform. A bilingual seasoned professional, Cupid holds various technical and business accreditations, such as PMP and Lean Six Sigma.

Presentations

Big data on the rise: Views of emerging trends and predictions from real-life end users Session

John Mertic and Cupid Chan share real end-user perspectives from companies like GE on how they are using big data tools, challenges they face, and where they are looking to focus investments—all from a vendor-neutral viewpoint.

Wei Chen is a software engineer at Intel. He is dedicated to performance optimization and simulation of storage engines for big data. Wei holds a master’s degree in signal and information processing from Nanjing University in China.

Presentations

Best practices with Kudu: An end-to-end user case from the automobile industry Session

Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.

Cheng Feng is a data engineer at Grab, where he works on the big data platform, distributed computing, stream processing, and data science. Previously, he was a data scientist at the Lazada Group, working on Lazada’s tracker, customer segmentation and recommendation systems, and fraud detection.

Presentations

Operationalizing Presto in the cloud: Lessons and mistakes Session

Grab uses Presto to support operational reporting (batch and near real-time), ad hoc analyses, and its data pipeline. Currently, Grab has 5+ clusters with 100+ instances in production on AWS and serves up to 30K queries per day while supporting more than 200 internal data users. Feng Cheng and Yanyu Qu explain how Grab operationalizes Preston in the cloud and share lessons learned along the way.

Victor Chua is a senior data scientist and visualizer on the innovations team at SmartHub (part of StarHub Ltd.), where he is responsible for building and delivering unique telco analytics products through big data technologies. Victor has a strong passion for data analytics and 3D graphics. He holds a master’s degree in information systems management from Carnegie Mellon University.

Presentations

Analyzing smart cities and big data in 3D: A Geo3D journey at SmartHub Smart Cities

The rise of densely populated, highly built-up smart cities around the globe has stretched the capabilities of current 2D visualization techniques. With the advent of drones, IoT devices, and indoor geolocation, next-gen 3D visualizations are beginning to address this challenge. Victor Chua explores how SmartHub is gearing up for a 3D future to support cutting-edge data analytics.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Data Case Studies welcome Tutorial

Program chair Alistair Croll welcomes you to the Data Case Studies tutorial.

Smart Cities welcome Tutorial

Program chair Alistair Croll welcomes you to the first day of the Smart Cities tutorial.

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Amit Das is the cofounder and CEO of Think Analytics India, where he has conceptualized many fintech enabler analytics solutions, including Algo360, an alternate data solution. Amit fell in love with analytics while working at Tata Consultancy Services. Over his career, he has worked for a number of successful companies, including Inductis (acquired by ExL Services) and Diamond Consultants (acquired by PwC LLP USA). Amit also led analytics delivery for PwC USA and was an EVP at 3i Infotech limited, where he set up analytics as a capability and built smarter software products for banking and financial services. Amit’s love for data led him to build the foundations for an emerging market consumer dataset through Vito, a cutting-edge alternate data solution for the Indian market that brings down the cost of underwriting by over 40%. Amit holds a master’s degree in management from the Indian institute of Management, Bangalore, and an undergraduate degree in economics from the University of Delhi.

Presentations

Driving financial inclusion in emerging markets using alternate data and big data analytics Session

Access to credit in emerging markets is impeded by issues around identity verification, risk assessment and monitoring, and the costs of underwriting and collections. At the core of it all is a lack of data. Amit Das explains how accessing alternate data, real-time risk monitoring and data access solutions, and smart analytics is changing the lending landscape in India.

Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine, Gobblin, a data lifecycle management platform for Hadoop, WhereHows, a data discovery and lineage platform, and Dali, a data virtualization layer for Hadoop.

Presentations

Privacy by design, not an afterthought: Best practices at LinkedIn Session

LinkedIn houses the most valuable professional data in the world. Protecting the privacy of member data has always been paramount. Shirshanka Das and Tushar Shanbhag outline three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement framework, and a unified data access layer.

Danielle Dean is a senior data scientist lead at Microsoft in the Algorithms and Data Science Group within Cloud and Enterprise, where she leads a team of data scientists and engineers on end-to-end analytics projects using Microsoft’s Cortana Intelligence Suite—from automating the ingestion of data to analysis and implementation of algorithms, creating web services of these implementations, and using those to integrate into customer solutions or build end-user dashboards and visualizations. Danielle holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill, where she studied the application of multilevel event history models to understand the timing and processes leading to events between dyads within social networks.

Presentations

Bootstrap custom image classification using transfer learning Session

Transfer learning enables you to use pretrained deep neural networks (e.g., AlexNet, ResNet, and Inception V3) and adapt them for custom image classification tasks. Danielle Dean and Wee Hyong Tok walk you through the basics of transfer learning and demonstrate how you can use the technique to bootstrap the building of custom image classifiers.

Training and scoring deep neural networks in the cloud Session

Deep neural networks are responsible for many advances in natural language processing, computer vision, speech recognition, and forecasting. Danielle Dean and Wee Hyong Tok illustrate how cloud computing has been leveraged for exploration, programmatic training, real-time scoring, and batch scoring of deep learning models for projects in healthcare, manufacturing, and utilities.

Masaru Dobashi is a system infrastructure engineer at NTT DATA Corporation, where he leads the OSS professional service team and is responsible for introducing Hadoop, Spark, Storm, and other OSS middleware into enterprise systems. Previously, Masaru developed an enterprise Hadoop cluster consisting of over 1,000 nodes—one of the largest Hadoop clusters in Japan—and designed and provisioned several kinds of clusters using non-Hadoop open source software, such as Spark and Storm.

Presentations

Fusing a deep learning platform with a big data platform Session

SmartHub and NTT DATA have embarked on a partnership to design next-generation architecture to power the data products that will help generate new insights. YongLiang Xu and Masaru Dobashi explain how deep learning and other analytics models coexist within the same platform to address issues relating to smart cities.

Wolff Dobson is a developer programs engineer at Google specializing in machine learning and games. Previously, he worked as a game developer, where his projects included writing AI for the NBA 2K series and helping design the Wii Motion Plus. Wolff holds a PhD in artificial intelligence from Northwestern University.

Presentations

TensorFlow: Open source machine learning Session

TensorFlow, the world's most popular machine learning framework, is fast, flexible, and production ready. Wolff Dobson, representing the Google Brain team, shares the latest developments in TensorFlow, including tensor processing units (TPUs), distributed training, new APIs and models, and mobile features. Join in to learn what's in store for TensorFlow and how ML can change your business.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

GDPR: Getting Your Data Ready for Heavy New EU Privacy Regulations Session

General Data Protection Regulation (GDPR) will go into effect in 2018 for firms doing any business in the EU. However many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.

Smart Cities, The Smart Grid, IoT, and Big Data Smart Cities

Smart Cities and the electricity smart grid has rapidly become leading examples of IoT: distributed sensors describe mission-critical behaviour by generating billions of metrics daily. Learn how smart utilities and cities rely on Hadoop to capture, analyze, and harness this data to increase safety, availability, and efficiency across the entire electricity grid.

Graham Dumpleton is a developer advocate for OpenShift at Red Hat. Graham is the author of mod_wsgi, a popular module for hosting Python web applications with the Apache HTTPD web server. He has a keen interest in Docker and platform-as-a-service (PaaS) technologies. Graham is a fellow of the Python Software Foundation and an emeritus member of the Apache Software Foundation.

Presentations

Deploying a scalable JupyterHub environment for running Jupyter notebooks Session

Jupyter notebooks provide a rich interactive environment for working with data. Running a single notebook is easy, but what if you need to provide a platform for many users at the same time. Graham Dumpleton demonstrates how to use JupyterHub to run a highly scalable environment for hosting Jupyter notebooks in education and business.

Joey Echeverria is the director of engineering at Rocana, where he builds applications for scaling IT operations built on the Apache Hadoop platform. Joey is a committer on the Kite SDK, an Apache-licensed data API for the Hadoop ecosystem. Joey was previously a software engineer at Cloudera, where contributed to several ASF projects including Apache Flume, Apache Sqoop, Apache Hadoop, and Apache HBase. Joey is also a coauthor of Hadoop Security, published by O’Reilly Media.

Presentations

Debugging Apache Spark Session

Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. lden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark, and more.

Bruno Fernandez-Ruiz is cofounder and CTO at Nexar, where he and his team are using large-scale machine learning and machine vision to capture and analyze millions of sensor and camera readings in order to make our roads safer. Previously, Bruno was a senior fellow at Yahoo, where he oversaw the development and delivery of Yahoo’s personalization, ad targeting, and native advertising teams; his prior roles at Yahoo included chief architect for Yahoo’s cloud and platform and chief architect for international. Prior to joining Yahoo, Bruno founded OneSoup (acquired by Synchronica and now part of the Myriad Group) and YamiGo; was an enterprise architect for Fidelity Investments; served as manager in Accenture’s Center for Strategic Research Group, where he cofounded Meridea Financial Services and Accenture’s claim software solutions group. Bruno holds an MSc in operations research and transportation science from MIT, with a focus on intelligent transportation systems.

Presentations

Pascale Fung is a professor in the Department of Electronic & Computer Engineering at the Hong Kong University of Science & Technology. She is the founding director of InterACT at HKUST, a joint research and education center with Carnegie Mellon University, University of Karlsruhe (TH), and Waseda University. She also cofounded the Human Language Technology Center (HLTC). Pascale is an affiliated faculty member with the Robotics Institute at HKUST and the founding chair of the Women Faculty Association at HKUST. Previously, she worked and studied at AT&T Bell Labs; BBN Systems & Technologies; LIMSI; CNRS, France; the Department of Information Science, Kyoto University, Japan; and at Ecole Centrale Paris, France. Pascale is an elected fellow of the Institute of Electrical and Electronic Engineers (IEEE) for her contributions to human-machine interactions and an elected fellow of the International Speech Communication Association for her fundamental contributions to the interdisciplinary area of spoken language human-machine interactions.

Pascale’s research interests lie in building intelligent systems that can understand and empathize with humans. To achieve this goal, her specific areas of research are statistical natural language processing, spoken language systems, emotion and sentiment recognition, and predictive analytics. She is an editor for Computer Speech and Language and former associate editor of a number of IEEE/ACM/ACL journals. Pascale has chaired and area-chaired the top conferences in the speech and language fields, served as a committee member of the IEEE Signal Processing Society Speech and Language Technology Committee (SLTC) for six years, and is the vice president and a board member of the ACL Special Interest Group on Linguistics Data and Corpus-Based Approaches in NLP (SIGDAT). She has also cofounded a number of multinational companies focused on natural language interfaces and personalized recommendation systems for Internet users and corporate customers. She was an associate chair of the HKUST One Million Dollar Entrepreneurship Competition and pioneered and has been teaching Engineering Entrepreneurship since 2007. Pascale received her PhD in computer science from Columbia University.

Presentations

Keynote with Pascal Fung Keynote

Pascale Fung, Professor, The Hong Kong University of Science and Technology

Siddha Ganju is a data scientist at Deep Vision, where she works on building deep learning models and software for embedded devices. Siddha is interested in problems that connect natural languages and computer vision using deep learning. Her work ranges from visual question answering to generative adversarial networks to gathering insights from CERN’s petabyte scale data and has been published at top tier conferences like CVPR. She is a frequent speaker at conferences and advises the Data Lab at NASA. Siddha holds a master’s degree in computational data science from Carnegie Mellon University, where she worked on multimodal deep learning-based question answering. When she’s not working, you might catch her hiking.

Presentations

Using the Iot + deep learning to track half a million faces in real time Session

We've come a long way since the advent of the IoT to a network of almost 30 billion IoT devices that include sensors and cameras. The data they gather and transmit is becoming increasingly complex. Siddha Ganju explains how deep learning can revolutionize IoT applications to recognize half a million faces at international airports using existing airport cameras.

Graham Gear is director of system engineering at Cloudera and an Apache Hadoop committer. Having thirstily read the Google papers that inspired Hadoop and watched as the community coalesced, Graham could clearly see the huge potential of the Hadoop ecosystem and has been contributing to the Hadoop ecosystem and helping organizations take advantage of it for many years. Previously, Graham delivered large-scale distributed systems with a keen analytical focus; he began his career implementing sonar algorithms leveraging MPI on large Beowulf clusters at a defense research institution.

Presentations

Real-world patterns for continuously deployed advanced analytics Session

How can we drive more data pipelines, advanced analytics, and machine learning models into production? How can we do this both faster and more reliably? Graham Gear draws on real-world processes and systems to explain how it's possible to apply continuous delivery techniques to advanced analytics, realizing business value earlier and more safely.

Bas Geerdink is a programmer, scientist, and IT manager at ING, where he is responsible for the fast data systems that process and analyze streaming data. Bas has a background in software development, design, and architecture with broad technical experience from C++ to Prolog to Scala. His academic background is in artificial intelligence and informatics. Bas’s research on reference architectures for big data solutions was published at the IEEE conference ICITST 2013. He occasionally teaches programming courses and is a regular speaker at conferences and informal meetings.

Presentations

Fast Data at ING - streaming analytics solutions to create a real-time, data-driven bank Session

ING is using Apache Flink for creating streaming analytics (fast data) solutions. We created a platform with Flink, Kafka and Cassandra that offers high-throughput and low-latency, ideally suited for complex and demanding use cases in the international bank such as customer notifications and fraud detection. This presentation gives an overview of the platform: architecture, use cases, and more!

Adam Gibson is the CTO and cofounder of Skymind, a deep learning startup focused on enterprise solutions in banking and telco, and the coauthor of Deep Learning: A Practitioner’s Approach.

Presentations

Unsupervised fuzzy labeling using deep learning to improve anomaly detection Session

Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.

Gaurav Godhwani is the technical lead for the Open Budgets India initiative, in association with CBGA. This initiative aims to promote greater transparency, accountability, and public participation in budget processes by making India’s budgets open, usable, and easy to comprehend. Gaurav is also one of the chapter leaders for DataKind Bangalore, where he is building a team of pro bono data scientists to help nonprofits tackle projects addressing critical humanitarian problems.

Presentations

Open Budgets India: Lessons from the front line Session

Most of the India’s budget documents aren’t easily accessible. Those published online are mostly available as unstructured PDFs, making it difficult to search, analyze, and use this crucial data. Gaurav Godhwani discusses the process of creating Open Budgets India and making India’s budgets open, usable, and easy to comprehend.

Ajey Gore is group CTO at GO-JEK, where he helps the company deliver a transport, logistics, lifestyle, and payments platform of 18 products. Ajey has 18 years of experience building core technology strategy across diverse domains. His interests include machine learning, networking, and scaling products. Previously, Ajey founded CodeIgnition (acquired by GO-JEK) and served as ThoughtWorks’s head of technology. An active influencer in the technology community, Ajey organizes conferences, including RubyConf, GopherCon, and devopsdays, through his not-for-profit organization.

Presentations

Keynote with Ajey Gore Keynote

Keynote with Ajey Gore

Ishmeet Grewal is a senior research analyst at Accenture Technology Labs, where he is the lead developer responsible for developing and prototyping a comprehensive strategy for automated analytics at scale. Ishmeet has traveled to 25 countries and likes to climb rocks in his free time.

Presentations

DevOps for models: How to manage millions of models in production—and at the Edge Session

As Accenture scaled to millions of predictive models, it required automation to ensure accuracy, prevent false alarms, and preserve trust. Teresa Tung, Ishmeet Grewal, and Jurgen Weichenberger explain how Accenture implemented a DevOps process for analytical models that's akin to software development—guaranteeing analytics modeling at scale and even in non-cloud environments at the edge.

Arwen is an Oregonian (think Portlandia) expat living in Melbourne for 7 years. She has been working as a Data scientist with Zendesk and is part of the team producing deep learning solutions for customer self service.

Arwen has a PhD in Machine Learning with a minor in Ecoinformatics. She is passionate about improving the status of under represented groups in STEM fields and applying machine learning to make the world a little bit better.

Presentations

AHA Moments in Deep Learning at Zendesk Session

Deep Learning is presently -the- coolest kid on the machine learning block, but few companies are using this technology in a production environment. Zendesk uses deep learning to power Answer Bot, a question answering system that resolves support tickets without agent intervention. In this session we’ll share our descent into deep learning, the challenges and benefits we’ve seen along the way.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him.

Presentations

Getting started with TensorFlow Tutorial

We will walk you through training and deploying a machine-learning system using TensorFlow, a popular open source ML library. Starting from conceptual overviews, we will build all the way up to complex classifiers. You’ll gain insight into deep learning and how it can apply to complex problems in science and industry.

TensorFlow Wide & Deep: Data Classification the easy way Session

Learn how to use TensorFlow to easily combine linear regression models and deep neural networks with a machine learning model that has the benefits of both. You will also gain intuition about what is happening under the hood, and learn how you can use this model for your own datasets.

Andreas Hadimulyono is a data warehouse engineer at Grab, where he ensures uninterrupted, error-free up time while meeting the SLA requirements of business intelligence, analytics, and data science workloads. Previously, Andreas worked for Human Longevity Singapore, where he was responsible for the data pipeline for phenotypic data, which is used for genotype and phenotypes association studies.

Presentations

Streaming analytics at Grab Session

Andreas Hadimulyono discusses the challenges that Grab is facing with the ever-increasing volume and velocity of its data and shares the company's plans to overcome them.

Chris leads the Data Science team at Zendesk. Previously he’s held the titles of data scientist, data engineer, researcher, PhD student, consultant, programmer and before that student again. He describes his role as ‘turning lots of data into magic’ and he does so with the help of Machine Learning, Python, Hadoop, graphs galore and amazing colleagues.

Presentations

AHA Moments in Deep Learning at Zendesk Session

Deep Learning is presently -the- coolest kid on the machine learning block, but few companies are using this technology in a production environment. Zendesk uses deep learning to power Answer Bot, a question answering system that resolves support tickets without agent intervention. In this session we’ll share our descent into deep learning, the challenges and benefits we’ve seen along the way.

Mick Hollison is chief marketing officer at Cloudera, where he leads the company’s worldwide marketing efforts, including advertising, brand, communications, demand, partner, solutions, and web. Mick has had a successful 25-year career in enterprise and cloud software. Previously, he was CMO of sales acceleration at machine learning company InsideSales.com, where, under his leadership, InsideSales pioneered a shift to data-driven marketing and sales that has served as a model for organizations around the globe; was global vice president of marketing and strategy at Citrix, where he led the company’s push into the high-growth desktop virtualization market; managed executive marketing at Microsoft; and held numerous leadership positions at IBM Software. Mick is an advisory board member for InsideSales and a contributing author on Inc.com. He is also an accomplished public speaker who has shared his insightful messages about the business impact of technology with audiences around the world. Mick holds a bachelor of science in management from the Georgia Institute of Technology.

Presentations

Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it Session

Mick Hollison shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.

William Householder is a senior instructor for Cloudera University, where he delivers instructor-led training courses across the APAC region on Cloudera’s distribution of Apache Hadoop. William specializes in administration, development, and analysis using HDFS, YARN, MapReduce, Spark, Impala, Hive, Solr, HBase, Flume, and Kafka.

Presentations

Cloudera big data architecture workshop (Day 2) Training Day 2

This training brings together technical contributors in a group setting to design and architect solutions to a challenging business problem. You'll explore big data application architecture concepts in general and then apply them to the design of a challenging system.

Yiqun Hu is the head of data for SP Digital, where he is responsible for leading the data team on the development of machine learning capabilities for energy and utility applications. Previously, he helped several organizations build data-driven products such as image recognition systems and recommendation engines. Yiqun is the author of 30+ scientific publications in the machine learning area, with over 1,400 citations. He holds a PhD from Nanyang Technological University and a bachelor’s degree in computer science from Xiamen University.

Presentations

Energy disaggregation using self-taught deep networks Session

Energy disaggregation is very useful for energy-related applications such as energy monitoring, but only small amount of labeled data is available because labelling is very expensive. Yiqun Hu shares a new solution using two deep networks: the first RNN-based network extracts good features from unlabeled data; the second deep network uses these features to disaggregate target appliances.

Mark Hunter is chief data officer at Sainsbury’s Bank. Previously, Mark was head of analytics and digital products at Coles Financial Services, where he worked across Beijing, Hong Kong, and Melbourne. He has served as deputy chair of ISAC, an analytics industry association in Australia.

Presentations

Organizing for machine learning success Session

Deploying machine learning in business requires far more than just selecting an algorithm. You need the right architecture, tools, and team organization to drive your agenda successfully. John Akred and Mark Hunter share practical advice on the technical and human sides of machine learning, based on experience preparing Sainsbury’s for its ML-enabled future.

I’m a manager of SK Telecom, South Korea’s largest wireless communications provider. I have 6 years experience as a UX developer and now I’m in charge of UX visualization of Big Data in SKT.

Presentations

Big Telco Real-Time Network Analytics Session

Data transfer is one of the most pressing problems for companies in the telecom industry today. As data requirements grow from month to month, the cost for dealing the mass also goes extremely high. To a certain point, huge data processing has been sufficient for SKT. In this presentation, Yousun Jeong will detail how SKT dealt with this problem.

Vickye jointly runs the Big Data expertise center within ZS and has extensive experience implementing large scale Big Data platforms for Fortune 200 companies in the US. He and his team have implemented very large scale ETL offloading use cases, Data Lakes, and high performance data processing platforms that have had transformation business impact on Commercial, R&D, and Operations organizations within LifeSciences.

Presentations

High performance enterprise data processing with Spark Session

We will talk about our experiences in building a very high performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance.

Yousun Jeong is an IT manager of SK Telecom (SKT), South Korea’s largest wireless communications provider. Big data analysis at SKT mainly focuses on customer retention and customer experience improvements, and essentially needs sophisticated real-time data processing.

Presentations

Big Telco Real-Time Network Analytics Session

Data transfer is one of the most pressing problems for companies in the telecom industry today. As data requirements grow from month to month, the cost for dealing the mass also goes extremely high. To a certain point, huge data processing has been sufficient for SKT. In this presentation, Yousun Jeong will detail how SKT dealt with this problem.

Calvin Jia is the release manager for Alluxio and is a core maintainer of the project. He is also the top contributor to the Alluxio project and one of its earliest contributors. Calvin holds a BS from the University of California, Berkeley.

Presentations

Decoupling compute and storage with open source Alluxio Session

Calvin Jia and Haoyuan Li explain how to decouple compute and storage with Alluxio, exploring the decision factors, considerations, and production best practices and solutions to best utilize CPUs, memory, and different tiers of disaggregated compute and storage systems to build out a multitenant high-performance platform.

Xianyan is currently a Software Engineer of Intel, responsible for developing deep learning/machine learning algorithms and pipelines. She is also a contributor of BigDL (https://github.com/intel-analytics/BigDL/) project, a distributed deep learning framework on Apache Spark.

Presentations

Bringing deep learning into big data analytics using BigDL Session

Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL and teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark.

Melanie Johnston-Hollitt is an internationally prominent radio astronomer working in the space between astrophysics, computer science and big data, the director of astronomy and astrophysics at Victoria University of Wellington, and CEO of Peripety Scientific Ltd., an astrophysics and data analytics research company based in Wellington, New Zealand. Melanie serves as chair of the board of the $60 million Murchison Widefield Array (MWA) radio telescope and is a founding member of the board of directors of the Square Kilometre Array (SKA) Organisation Ltd., which is tasked with building the world’s largest radio telescope. In her nearly 20-year career, she has been involved in design, construction, and operation of several major radio telescopes, including the Low Frequency Array in the Netherlands, the MWA in Australia, and the SKA, which will be hosted in both Australia and South Africa. These instruments produce massive quantities of data, requiring new and disruptive technologies to allow value to be extracted from the data deluge. As a result, Melanie’s recent interests span the intersection between radio astronomy, signal processing, and big data analytics. She leads a multidisciplinary team in Wellington that is investigating how best to meet the science challenges of these next-generation instruments in the big data era.

Presentations

Keynote with Melanie Johnston-Hollitt Keynote

Keynote with Melanie Johnston-Hollitt

Dirk Jungnickel is a senior vice president heading the central Business Analytics and Big Data function of Emirates Integrated Telecommunications Company (du), which integrates data warehousing, big data platforms, BI tools, data governance, business intelligence, and advanced analytics capabilities. Following an academic career in theoretical physics, with more than seven years of postdoctoral research, he has spent 17 years in telecommunications. A seasoned telecommunications executive, Dirk has held a variety of roles in international firms, including senior IT and IT architecture roles, various program management and business intelligence positions, the head of corporate PMO, and an associate partner with a global management and strategy consulting firm.

Presentations

The Argument for Data Centralization: A Telecommunications Use Case Session

Dubai-based telco leader, du, will discuss leveraging a centralized data lake to improve customer experience, create smart cities, address unexpected business challenges, and even enable data monetization. The session covers business outcomes, addresses technical challenges, includes architectural considerations, platform requirements for IoT, and performing root cause analyses.

Amit Kapoor is interested in learning and teaching the craft of telling visual stories with data. At narrativeVIZ Consulting, Amit uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Amit also teaches storytelling with data as guest faculty in executive courses at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. He has more than 12 years of management consulting experience with AT Kearney in India, Booz & Company in Europe, and more recently for startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi and a PGDM (MBA) from IIM, Ahmedabad. Find more about him at Amitkaps.com.

Presentations

Interactive visualization for data science Tutorial

One of the challenges in traditional data visualization is that they are static and have bounds on limited physical/pixel space. Interaction allows us to move beyond this limitation by adding layers of interactions. Bargava Subramanian and Amit Kapoor teach the art and science of creating interactive data visualizations.

Holden Karau is transgender Canadian, Apache Spark committer, an active open source contributor, and co-author of Learning Spark & High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science.

Presentations

Debugging Apache Spark Session

Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. lden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark, and more.

Extending Spark ML: Adding custom pipeline stages to Spark Session

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. Adding your own pipeline stages allow you to use Spark's meta algorithms and existing ML tools.

Sunil Karkera is the CTO of the Digital Enterprise unit and the head of the Digital Reimagination Studio at Tata Consultancy Services, where he and his team of creative designers, engineers, and business strategists apply design thinking methodologies to fundamentally rethink business models, user experiences, and enabling technologies. Sunil has over 20 years’ experience in technology and design and has founded three startups in the Silicon Valley, all of which underwent successful acquisitions. He is a trained engineer and a typographer. Sunil was part of the early enterprise data wave, creating eBusiness Anywhere, which was acquired by Siebel Systems in 1999. Sunil was part of the successful IPO of Sonicwall, which disrupted internet security technologies, where he was part of the engineering team that built network security technologies, including SSL acceleration, content filtering, high-performance packet filters, and a high-networking throughput operating system, SonicOS. Sunil was the vice president of business systems at Fox Interactive Media, responsible for engineering MySpace, American Idol, Twentieth Century Fox Studios, IGN, Rotten Tomatoes, and GameSpy. In 2007, he cofounded Registria to innovate in product registration and after sales service and marketing. During this time, he architected the ecommerce and IoT backends for Nest, as well as created the type design used in the Nest Thermostat user interface. In 2014, he cofounded Nurture software, which focused on creating mobile apps backed with advanced machine-learning technologies in the service of better health and wellness for women and children. Sunil has been granted US patents in areas including caching, product registration, configuration management, and dynamic email processing. He holds a bachelor’s degree in computer science from Mangalore University in India. He has attended advanced typography design and color design programs at the University of Zurich (Zürcher Hochschule der Künste) under Rudolf Barmettler as well as modern arts programs at the Stedelijk Museum, Amsterdam for pointillism and graffiti. Sunil is passionate about computer history and volunteers at the Computer History Museum in Mountain View, CA.

Presentations

Designing AI-enabled experiences Session

AI-enabled systems bring in a different gamut of challenges for experience designers. Designing experiences with recommenders, conversational systems, sentiment watchers, can be challenging due to their probabilistic nature. But effectively designed experiences can reduce steps, improve user's self efficacy, and can enable magical experiences.

Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where developed and open-sourced techniques for anomaly detection and breakout detection. Prior research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Leveraging Live Data to Realize the Smart Cities Vision Smart Cities

With the proliferation of IoT devices, the volume of Live Data has been growing by leaps of bounds. One of the key application domains where Live Data can be leveraged is Smart Cities. To this end, availability of generic platforms which support high throughput and ultra-low latency is critical. In this talk, we walk through a concrete country-scale case study based on the Satori platform.

Jisung Kim is a software engineer at SK Telecom. JiSung has 12 years of experience working with code to build distributed data architectures and applications, integrating legacy system and big data technologies. Recently, he has been focused on advanced analytics using scalable machine learning algorithms with big data.

Presentations

Big data: The best way to truly understand customers in Telco DCS

In the telecommunication industry, quality of service in networks is the customer’s top concern, but it is difficult to analyze due to the increasingly massive volume of data. Kyungtaak Noh and Jisung Kim offer their solution—a quality management system that integrates Hadoop and big data technology—and explain how they use it to efficiently visualize and utilize big data.

Markus Kirchberg is CEO of Wismut Labs Pte. Ltd., where he leads a team of diverse technical experts that help modernize and transform clients’ products, services, and operations. Markus has over 20 years experience in research and technology-driven innovation, and his career spans academia, dedicated research centers, and industrial research and incubation labs. Previously, Markus was the head of technology innovation at Deep Labs, where he was responsible for driving and delivering technology innovation across the Asia Pacific region; headed Visa Labs, Asia Pacific; served as an expert at HP Labs Singapore, where he led various innovation initiatives on next generation, cross-domain data analytics platforms; worked as a research fellow and principal investigator at the Institute for Infocomm Research at A*STAR; and was a lecturer at Massey University, New Zealand. Markus’s skill set includes full innovation lifecycle management, automating infrastructure, cloud computing, data management at multipetabyte scale, data privacy, deep learning, emerging technologies, the internet of things, large-scale data analytics, and extreme transaction processing. He has extensive experience in healthcare, logistics, payments analytics and processing, risk management, and transportation.

Presentations

Payment fraud detection and prevention in the age of big data, network science, and AI Session

As the share of digital payments increases so does payment fraud, which has almost tripled between 2013 and 2016. Markus Kirchberg explains how recent advances in AI and machine learning, decision sciences, and network sciences are driving the development of next-generation payment fraud capabilities for fraud scoring, deceptive merchant detection, and merchant compromise detection.

Mike Koelemay runs the Data Science team within Advanced Analytics at Sikorsky, where he is responsible for bringing state-of-the-art analytics and algorithm technologies to support the ingestion, processing, and serving of data collected onboard thousands of aerospace assets around the world. Drawing on his 10+ years of experience in applied data analytics for integrated system health management technologies, Mike works with other software engineers, data architects, and data scientists to support the execution of advanced algorithms, data mining, signal processing, system optimization, and advanced diagnostics and prognostics technologies, with a focus on rapidly generating information from large, complex datasets.

Presentations

Where Data Science Meets Rocket Science: Data Platforms and Predictive Analytics for Aerospace DCS

Sikorsky collects data on-board thousands of helicopters deployed worldwide that is used for fleet management services, engineering analyses and business intelligence. This session will present the data platform that Sikorsky has built to manage the ingestion, processing, and serving of this data so that it can be used to rapidly generate information that can be used to drive decision making.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. Jared is the organizer of the New York Open Statistical Programming Meetup and the New York R Conference, as well as an adjunct professor of statistics at Columbia University, in addition to his client-facing consulting and training. Jared specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world. He was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Machine Learning in R Tutorial

Modern statistics has become almost synonymous with machine learning; a collection of techniques that utilize today's incredible computing power. This course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theories behind the curtain, covering the Elastic Net, Boosted Trees and cross-validation.

Making R Go Faster and Bigger Session

One common, but false, knock against R is that it doesn't scale well. This talk shows how to use R in a performant matter both in terms of speed and data size. In this talk we learn packages for running R at scale.

Philip Langdale is the engineering lead for cloud at Cloudera. He joined the company as one of the first engineers building Cloudera Manager and served as an engineering lead for that project until moving to working on cloud products. Previously, Philip worked at VMware, developing various desktop virtualization technologies. Philip holds a bachelor’s degree with honors in electrical engineering from the University of Texas at Austin.

Presentations

A deep dive into running big data workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Jason Wang, and Fahd Siddiqui lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud, highlighting cloud infrastructure best practices and illustrating how data engineering workloads interoperate with data analytic engines.

Steve Leonard is the founding CEO of SGInnovate, a private limited company wholly owned by the Singapore government. Capitalizing on the science and technology research for which Singapore has gained a global reputation,Steve’s team works with local and international partners, including universities, venture capitalists, and major corporations, to help technical founders imagine, start, and scale globally relevant early-stage technology companies from Singapore. A technology-industry leader with a wide range of experience, Steve has played a key role in building several global companies in areas such as software, hardware, and services. Previously, he was the executive deputy chairman of the Infocomm Development Authority (IDA), a government statutory board under the purview of Singapore’s Ministry of Communications and Information, where he was responsible for various aspects of the information technology and telecommunications industries in Singapore on a national level. Steve serves on the advisory boards of a number of universities and organizations in Singapore and is an independent non-executive director of AsiaSat, a Hong Kong Stock Exchange-listed commercial operator of communication spacecraft. Although born in the US, Steve considers himself a member of the larger global community, having lived and worked outside the US for more than 25 years.

Presentations

Keynote with Steve Leonard Keynote

Keynote with Steve Leonard

Dong Li is a technical partner and senior software architect at Kyligence. Dong is also an Apache Kylin committer and PMC member and the tech lead for KyBot. Previously, Dong was a senior software engineer in the Analytics Data Infrastructure Department at eBay and a software development engineer in the Cloud and Enterprise Department at Microsoft, where he was a core member of the dynamics APAC team, responsible for developing next-generation cloud-based ERP solutions. Dong holds both a bachelor’s and master’s degree from Shanghai Jiao Tong University.

Presentations

Apache Kylin: Advanced tuning and best practices with KyBot Session

Apache Kylin is an extreme distributed OLAP engine on Hadoop. Well-tuned cubes bring about the best performance with the least cost but require a comprehensive understanding of tuning principles to use. Dong Li explains advanced tuning and introduces practices with KyBot, which helps find and solve bottlenecks in an intelligent way with AI methods performed on log analysis results.

Haoyuan Li is founder and CEO of Alluxio (formerly Tachyon Nexus), a memory-speed virtual distributed storage system. Before founding the company, Haoyuan was working on his PhD at UC Berkeley’s AMPLab, where he cocreated Alluxio. He is also a founding committer of Apache Spark. Previously, he worked at Conviva and Google. Haoyuan holds an MS from Cornell University and a BS from Peking University.

Presentations

Decoupling compute and storage with open source Alluxio Session

Calvin Jia and Haoyuan Li explain how to decouple compute and storage with Alluxio, exploring the decision factors, considerations, and production best practices and solutions to best utilize CPUs, memory, and different tiers of disaggregated compute and storage systems to build out a multitenant high-performance platform.

Simon Lidberg is a solution architect within Microsoft’s Data Insights Center of Excellence. He has worked with database and data warehousing solutions for almost 20 years in a various of industries and has more recently focused on analysis, BI, and big data. Simon is the author of Getting Started with SQL Server 2012 Cube Development.

Presentations

The value of a data science center of excellence (COE) Session

As organizations turn to data-driven strategies, they are also increasingly exploring the creation of a data science or analytic center of excellence (COE). Benjamin Wright-Jones and Simon Lidberg outline the building blocks of a center of excellence and describe the value for organizations embarking on data-driven strategies.

Yu-Xi Lim is lead data scientist at Teralytics, where he leads the technical team in the company’s Singapore office. Yu-Xi is interested in applying data science to retail and travel. Previously, he led teams at Southeast Asian ecommerce giant Lazada and at TravelShark; was vice president of engineering at payment startup Fastacash; and was a software engineer in Microsoft’s Windows Division. Yu-Xi holds a PhD in electrical and computer engineering from Georgia Tech, where he did research on WiFi positioning systems.

Presentations

Distributed real-time highly available stream processing Session

Yu-Xi Lim outlines a high-throughput distributed software pattern capable of processing event streams in real time. At its core, the pattern relies on functional reactive programming idioms to shard and splice state fragments, ensuring high horizontal scalability, reliability, and high availability.

Rhea Liu is an analyst at China Tech Insights, an internet research unit affiliated with Tencent’s Online Media Group (which hosts products including QQ.com, Tencent News, and Tencent Video). Rhea focuses on China’s leadings trends in the internet industry, especially online education and the application of new tools, such as artificial intelligence, in consumer products.

Presentations

Keynote with Rhea Liu Keynote

Keynote with Rhea Liu

Session with Rhea Liu Session

Rhea Liu, Analyst, China Tech Insights

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Managing successful data projects: Technology selection and team building DCS

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Top five mistakes when writing streaming applications Session

Ted Malaska shares the top five mistakes that no one talks about when you start writing your streaming app along with the practices you'll inevitably need to learn along the way.

Bruce Martin is a senior instructor at Cloudera, where he teaches courses on data science, Apache Spark, Apache Hadoop, and data analysis. Previously, Bruce was principal architect and director of advanced concepts at SunGard Higher Education, where he developed the software architecture for SunGard’s Course Signals Early Intervention System, which uses machine learning algorithms to predict the success of students enrolled in university courses. Bruce’s other roles have included senior staff engineer at Sun Microsystems and researcher at Hewlett-Packard Laboratories. Bruce has written many papers on data management and distributed system technologies and frequently presents his work at academic and industrial conferences. Bruce has authored patents on distributed object technologies. Bruce holds a PhD and master’s degree in computer science from the University of California, San Diego, and a bachelor’s degree in computer science from the University of California, Berkeley.

Presentations

Cloudera big data architecture workshop 2-Day Training

Bruce Martin leads a training that brings together technical contributors in a group setting to design and architect solutions to a challenging business problem. You'll explore big data application architecture concepts in general and then apply them to the design of a challenging system.

Peng Meng is a senior software engineer on the big data and cloud team at Intel, where he focuses on Spark and MLlib optimization. Peng is interested in machine learning algorithm optimization and large-scale data processing. He holds a PhD from the University of Science and Technology of China.

Presentations

Apache Spark ML and MLlib tuning and optimization: A case study on boosting the performance of ALS by 60X Session

Apache Spark ML and MLlib are hugely popular in the big data ecosystem, and Intel has been deeply involved in Spark from a very early stage. Peng Meng outlines the methodology behind Intel's work on Spark ML and MLlib optimization and shares a case study on boosting the performance of Spark MLlib ALS by 60x in JD.com’s production environment.

John Mertic is director of program management for ODPi and Open Mainframe Project at the Linux Foundation. John comes from a PHP and open source background. Previously, he was director of business development software alliances at Bitnami, a developer, evangelist, and partnership leader at SugarCRM, board member at OW2, president of OpenSocial, and a frequent conference speaker around the world. As an avid writer, John has published articles on IBM Developerworks, Apple Developer Connection, and PHP Architect and authored The Definitive Guide to SugarCRM: Better Business Applications and Building on SugarCRM.

Presentations

Big data on the rise: Views of emerging trends and predictions from real-life end users Session

John Mertic and Cupid Chan share real end-user perspectives from companies like GE on how they are using big data tools, challenges they face, and where they are looking to focus investments—all from a vendor-neutral viewpoint.

Harjinder Mistry is a member of the developer tools team at Red Hat, where he is incorporating data science into next-generation developer tools powered by Spark. Previously, he was a member of IBM’s analytics team, where he developed Spark ML Pipelines components for the IBM Analytics platform, and spent several years on the DB2 SQL Query Optimizer team building and fixing the mathematical model that decides the query execution plan. Harjinder holds an MTech from IIIT, Bangalore, India.

Presentations

A recommendation system for wide transactions Session

Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide and outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud.

Engineering cloud-native machine learning applications Session

In the current Agile business environment, where developers are required to experiment multiple ideas and also react to various situations, doing cloud-native development is the way to go. Harjinder Mistry and Bargava Subramanian explain how to design and build a microservices-based cloud-native machine learning application.

Based in Singapore, David Mueller is the practice partner for advanced analytics for Teradata’s ASK region, where he leads an international team of data scientists who support customer projects across Southeast Asia, India, Pakistan, and South Korea. As subject-matter expert for artificial intelligence and deep learning at Teradata International, David is passionate about bringing the benefits of novel analytical approaches to the enterprise. David’s background is in digital customer and marketing analytics. Previously, he headed the data science team at a German ad tech company.

Presentations

Deep learning for recommender systems Tutorial

Tim Seears and David Mueller explain how to apply deep learning to improve consumer recommendations by training neural nets to learn categories of interest using embeddings and demonstrate how to extend this with WALS matrix factorization to achieve wide and deep learning—a process which is now used in production for the Google Play Store.

Prateek Nagaria is a data scientist for the Data Team. Prateek is an advanced analytics expert with more than five years of experience. He specializes in business analytics, big data technologies, and statistical modeling as well as programming languages like R, Python, Java, C, C++. Prateek holds a master’s degree in enterprise business analytics from the National University of Singapore and a bachelor’s degree in computer science and engineering.

Presentations

Forecasting intermittent demand: Traditional smoothing approaches versus the Croston method Session

Most data scientists use traditional methods of forecasting, such as exponential smoothing or ARIMA, to forecast a product demand. However, when the product experiences several periods of zero demand, approaches such as Croston may provide a better accuracy over these traditional methods. Prateek Nagaria compares traditional and Croston methods in R on intermittent demand time series.

Paco Nathan leads the Learning Group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

AI within O'Reilly Media Session

Paco Nathan explains how O'Reilly employs AI, from the obvious (chatbots, case studies about other firms) to the less so (using AI to show the structure of content in detail, enhance search and recommendations, and guide editors for gap analysis, assessment, pathing, etc.). Approaches include vector embedding search, summarization, TDA for content gap analysis, and speech-to-text to index video.

Daniel Ng
Senior Director, APAC
Cloudera

Daniel Ng, Senior Director, APAC, Cloudera, is an end-in-mind strategist, championing business and technology values for customers, from SMBs to Enterprises.
His 31 years of experience in the ICT industry saw him cover the APAC region in Marketing, Business Development, and Sales roles for companies like IBM, Microsoft, Red Hat, after his first job with Nixdorf as a Software Engineer.
He has led Sun Microsystems into becoming the first Multimedia Super Corridor (MSC) status company and has led efforts in establishing e-business for IBM, Open Source for Red Hat and now Big Data Analytics for Cloudera.
Daniel is also a winning mentor for the 2015 Lee Kuan Yew Global Business Plan Competition, and is an advisor for numerous start-ups in Singapore and globally.

Presentations

Progressing Data Science through Talent Development Session

As more companies are adopting a data-driven culture, the leveraging of technology towards this cause can only be realized with the availability of Data Professionals. In APAC, as in the rest of the world, this talent pool is small and with the increase of adoption of Big Data, Machine Learning, is a hindering factor.

Kyungtaak Noh is a software manager at SK Telecom. A software architect and full stack web application developer, Kyungtaak has been working in the area of visualizing and coordinating with big data applications.

Presentations

Big data: The best way to truly understand customers in Telco DCS

In the telecommunication industry, quality of service in networks is the customer’s top concern, but it is difficult to analyze due to the increasingly massive volume of data. Kyungtaak Noh and Jisung Kim offer their solution—a quality management system that integrates Hadoop and big data technology—and explain how they use it to efficiently visualize and utilize big data.

Supreet is a technology executive with a passion for building products and solutions for real time, distributed, and big data analytical applications. With over 20 years of experience, Supreet has worked in technical and leadership roles at Oracle Corporation, Concurrent Inc., American Express, Real-Time Innovations, Agile, Microsoft and many privately-held Silicon Valley companies.

He is also the Lead Mentor for startX, a Stanford student start-up accelerator designed and developed to provide a place for the top Stanford founders. Supreet received his BS degree in Computer Sciences with Highest Honors from University of Texas at Austin and MS in Computer Sciences from Stanford University. He is widely published in industry, and often presents at conferences. In his free time, Supreet is reconnecting with his old passion to paint.

Presentations

Querying Time-Series patterns with SAX Session

Time series data is any data set that is plotted over a range of time. Often, in IoT use cases, what is of interest is the sequence of measurements, or a pattern. However, such queries on data patterns are do not traditionally scale. In this talk, we will discuss how at Oracle we adapted and extended Symbolic Aggregate Approximation [SAX] to solve such challenges.

Jean-Baptiste Onofré is a fellow and software architect at cloud and big data integration software company Talend. An ASF member and contributor to roughly 20 different Apache projects, Jean-Baptiste specializes in both system integration and big data. He is also a champion and PPMC on multiple Apache Beam projects.

Presentations

How Apache Beam can advance your enterprise workloads Session

Apache Beam allows data pipelines to work in batch, streaming, and a variety of open source and private cloud data processing backends, including Apache Flink, Apache Spark, and Google Cloud Dataflow. Jean-Baptiste Onofré offers an overview of Apache Beam's programming model, explores mechanisms for efficiently building data pipelines, and demos an IoT use case dealing with MQTT messages.

Francois Orsini is the chief technology officer for MZ’s Satori business unit. Previously, he served as vice president of platform engineering and chief architect, bringing his expertise in building server-side architecture and implementation for a next-gen social and server platform; was a database architect and evangelist at Sun Microsystems; and worked in OLTP database systems, middleware, and real-time infrastructure development at companies like Oracle, Sybase, and Cloudscape. Francois has extensive experience working with database and infrastructure development, honing his expertise in distributed data management systems, scalability, security, resource management, HA cluster solutions, soft real-time and connectivity services. He also collaborated with Visa International and Visa USA to implement the first Visa Cash Virtual ATM for the internet and founded a VC-backed startup called Unikala in 1999. Francois holds a bachelor’s degree in civil engineering and computer sciences from the Paris Institute of Technology.

Presentations

Leveraging Live Data to Realize the Smart Cities Vision Smart Cities

With the proliferation of IoT devices, the volume of Live Data has been growing by leaps of bounds. One of the key application domains where Live Data can be leveraged is Smart Cities. To this end, availability of generic platforms which support high throughput and ultra-low latency is critical. In this talk, we walk through a concrete country-scale case study based on the Satori platform.

Clifton Phua is a director at NCS Group, leading a team of data scientists working on artificial intelligence, machine learning, and advanced analytics under the Smart and Safe City Centre of Excellence. Previously, Clifton worked at SAS Institute Pte Ltd on SAS analytics, where he specialized in big data analytics in public security (attack and disaster preparation, recovery, and response, cybersecurity, internal security, and predictive policing) and fraud (government, banking, and insurance); a data scientist at the Data Analytics Department (formerly known as the Data Mining Department) at the Institute for Infocomm Research (I2R) at the Agency for Science, Technology, and Research (A*STAR) in Singapore, where he focused on web monitoring of companies and technologies, assistive technology for people with dementia, and mobile phone activity recognition and worked on real-world energy-related analytics projects to improve parts of the smart grid and other big data applications; and a research fellow at the Data Mining Laboratory within the Department of Industrial Engineering at Seoul National University, South Korea. Clifton holds a PhD in identity crime detection and a bachelor’s degree with first-class honors, both from Monash University, Australia.

Presentations

Advanced analytics for a safe city Smart Cities

Clifton Phua offers an overview of several key applications of advanced analytics related to public safety, illustrating the potential value and insights that advanced analytics can bring to a safe city.

Michael Prorock is founder and CTO at mesur.io. Michael is an expert in systems and analytics, as well as in building teams that deliver results. Previously, he was director of emerging technologies for the Bardess Group, where he defined and implemented a technology strategy that enabled Bardess to scale its business to new verticals across a variety of clients, and worked in analytics for Raytheon, Cisco, and IBM, among others. He has filed multiple patents related to heuristics, media analysis, and speech recognition. In his spare time, Michael applies his findings and environmentally conscious methods on his small farm.

Presentations

Smart agriculture: Blending IoT sensor data with visual analytics on Apache Hive and Spark DCS

Mike Prorock and Hugo Sheng offer an overview of mesur.io, a game-changing climate awareness solution that utilizes Apache Hive, Spark, ESRI, and Qlik. Mesur.io combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market.

Xie Qi is a senior software engineer on the big data engineering team at Intel China, where he works on Spark optimization for Intel platforms. Xie has broad experience across big data, multimedia, and wireless.

Presentations

FPGA-based acceleration architecture for Spark SQL Session

Xie Qi and Quanfu Wang offer an overview of a configurable FPGA-based Spark SQL acceleration architecture that leverages FPGAs' very high parallel computing capability to tremendously accelerate Spark SQL queries and FPGAs' power efficiency to lower power consumption.

Yanyu Qu is a data engineer on Grab’s data engineering team, where he works on Spark and Presto’s data gateway. Previously, he worked at FunPlus, App Annie, IBM, and Teradata.

Presentations

Operationalizing Presto in the cloud: Lessons and mistakes Session

Grab uses Presto to support operational reporting (batch and near real-time), ad hoc analyses, and its data pipeline. Currently, Grab has 5+ clusters with 100+ instances in production on AWS and serves up to 30K queries per day while supporting more than 200 internal data users. Feng Cheng and Yanyu Qu explain how Grab operationalizes Preston in the cloud and share lessons learned along the way.

Kira Radinsky is the chief scientist and director of data science at eBay, where she is building the next-generation predictive data mining, deep learning, and natural language processing solutions that will transform ecommerce. She also serves as a visiting professor at the Technion, Israel’s leading science and technology institute, where she focuses on the application of predictive data mining in medicine. Kira cofounded SalesPredict (acquired by eBay in 2016), a leader in the field of predictive marketing—the company’s solutions that leveraged large-scale data mining to predict sales conversions. One of the up-and-coming voices in the data science community, Kira is pioneering the field of web dynamics and temporal information retrieval. She gained international recognition for her work at Microsoft Research, where she developed predictive algorithms that recognized the early warning signs of globally impactful events, including political riots and disease epidemics. She was named one of MIT Technology Review’s 35 young innovators under 35 for 2013 and one of Forbes’s 30 under 30 rising stars in enterprise technology for 2015; in 2016, she was recognized as woman of the year by Globes. Kira is a frequent presenter at global tech events, including TEDx and the World Wide Web Conference, and she has published in Harvard Business Review.

Presentations

Mining Electronic Health Records and the Web for Drug Repurposing Keynote

We jointly harness large-scale electronic health records and feasible conceptual links among concepts drawn from Wikipedia to provide guidance about drug repurposing -- the process of applying known drugs in new ways to treat diseases. We claim that researchers decide on exploratory targets for repurposing based on trends in research and observations on small numbers of cases, leading to...

Syed Rafice is a senior system engineer at Cloudera, where he specializes in big data on Hadoop technologies and is responsible for designing, building, developing, and assuring a number of enterprise-level big data platforms using the Cloudera distribution. Syed also focuses on both platform and cybersecurity. He has worked across multiple sectors, including government, telecoms, media, utilities, financial services, and transport.

Presentations

Smart Cities, The Smart Grid, IoT, and Big Data Smart Cities

Smart Cities and the electricity smart grid has rapidly become leading examples of IoT: distributed sensors describe mission-critical behaviour by generating billions of metrics daily. Learn how smart utilities and cities rely on Hadoop to capture, analyze, and harness this data to increase safety, availability, and efficiency across the entire electricity grid.

Greg Rahn is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over his 20-year career, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently, product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Rethinking data marts in the cloud: Common architectural patterns for analytics Session

Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance.

Isaac Reyes is a principal at DataSeer, where he leads a team that delivers in-house training courses in data storytelling, predictive analytics, and machine learning to companies such as Cisco, Ericsson, Hewlett Packard, and Pfizer. A data scientist, trainer and TEDx speaker who lives, breathes and dreams data, previously, Isaac lectured in statistical theory at the Australian National University and worked as a data scientist in the private sector.

Presentations

Data driven: Visualizing vehicle journey data for SE Asia’s leading ride-hailing company DCS

Like snowflakes, no two taxi rides are the same. Rides vary by origin, destination, and duration, among hundreds of other dimensions. Isaac Reyes explains why Grab—the ride-hailing app beating Uber in SE Asia—turned to DataSeer to help create rich data visualizations of their customer data when the company wanted to better understand the behavior of its customers.

The art of data storytelling Session

Isaac Reyes explores the art and science of data storytelling, covering the essential elements of a good data story, chart design and why it matters, the Gestalt principals of visual perception and how they can be used to tell better stories with data, and how to make over a poor visualization.

Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.

Presentations

Rethinking data marts in the cloud: Common architectural patterns for analytics Session

Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance.

Steve Ross is the director of product management at Cloudera, where he focuses on security across the big data ecosystem, balancing the interests of citizens, data scientists, and IT teams working to get the most out of their data while preserving privacy and complying with the demands of information security and regulations. Previously, at RSA Security and Voltage Security, Steve managed product portfolios now in use by the largest global companies and hundreds of millions of users.

Presentations

GDPR: Getting Your Data Ready for Heavy New EU Privacy Regulations Session

General Data Protection Regulation (GDPR) will go into effect in 2018 for firms doing any business in the EU. However many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.

Nikki Rouda is the cloud and core platform director at Cloudera. Nik has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their IT challenges. His career spans big data, analytics, machine learning, AI, storage, networking, security, and the IoT. Nik holds an MBA from Cambridge and an ScB in geophysics and math from Brown.

Presentations

Good everywhere: Managing security and governance in a hybrid- and multicloud world Session

Managing the security and governance of big data can be challenging on-premises but becomes far more difficult in a heterogeneous environment spanning a public cloud or across multiple cloud services. Nikki Rouda and Kelly Schupp share unbiased best practices to ensure your data is under control everywhere.

Kostas Sakellis is the lead and engineering manager of the Apache Spark team at Cloudera. Kostas holds a bachelor’s degree in computer science from the University of Waterloo, Canada.

Presentations

How to successfully run data pipelines in the cloud Session

With its scalable data store, elastic compute, and pay-as-you-go cost model, cloud infrastructure is well-suited for large-scale data engineering workloads. Kostas Sakellis explores the latest cloud technologies, focusing on data engineering workloads, cost, security, and ease-of-use implications for data engineers.

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists, particularly focusing on the Apache Spark ecosystem. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka. Neelesh holds a master’s degree in computer science with a focus on cloud computing from North Carolina State University and a bachelor’s degree in computer engineering from the University of Mumbai, India.

Presentations

Apache Spark in the hands of data scientists Session

Neelesh Srinivas Salian offers an overview of the data platform used by data scientists at Stitch Fix, based on the Spark ecosystem. Neelesh explains the development process and shares some lessons learned along the way.

Kaz Sato is Staff Developer Advocate at Cloud Platform team, Google Inc. Focusing on Machine Learning and Data Analytics products, such as TensorFlow, Cloud ML and BigQuery. Invited to major events including Google Cloud Next ’17 SF, Google I/O 2016 and 2017, Strata+Hadoop World London 2017, San Jose and NYC 2016, Hadoop Summit 2016, ODSC East 2016 and 2017. Kaz also has been leading and supporting developer communities for Google Cloud for over 8 years. He is also interested in hardwares and IoT, and has been hosting FPGA meetups since 2013.

Presentations

BigQuery and TensorFlow: Data Warehouse + Machine Learning enables the "smart" query Session

BigQuery is Google's fully managed, petabyte scale data warehouse. It's User Defined Function realizes "smart" queries with the power of machine learning, such as similarity search or recommendation on images or documents with feature vectors and neural network prediction. In this session we will see BigQuery and TensorFlow enables a powerful "data warehouse + ML" solution.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning with TensorFlow 2-Day Training

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Kelly Schupp is vice president of marketing at Zaloni and serves as Zaloni’s brand steward. Kelly is deeply passionate about the impact of data-driven marketing. She has 20 years of experience in the enterprise software and technology industry and has held a variety of global marketing leadership roles. Previously, Kelly worked at IBM, Micromuse, and Porter Novelli.

Presentations

Good everywhere: Managing security and governance in a hybrid- and multicloud world Session

Managing the security and governance of big data can be challenging on-premises but becomes far more difficult in a heterogeneous environment spanning a public cloud or across multiple cloud services. Nikki Rouda and Kelly Schupp share unbiased best practices to ensure your data is under control everywhere.

Tim Seears is area practice director for Asia-Pacific at Think Big, a Teradata company. Previously he was CTO and cofounder of Big Data Partnership (acquired by Teradata in 2016), which he cofounded after a career spent in the space industry working on NASA’s Cassini orbiter mission at Saturn. Tim and his team established Big Data Partnership as a dominant thought leader throughout the European market, providing data science, data engineering, and big data architecture services to global enterprise customers.

Presentations

Deep learning for recommender systems Tutorial

Tim Seears and David Mueller explain how to apply deep learning to improve consumer recommendations by training neural nets to learn categories of interest using embeddings and demonstrate how to extend this with WALS matrix factorization to achieve wide and deep learning—a process which is now used in production for the Google Play Store.

Jonathan Seidman is a software engineer on the partner engineering team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Managing successful data projects: Technology selection and team building DCS

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Tushar Shanbhag is head of data strategy and data products at LinkedIn. Tushar is a seasoned executive with track record of building high-growth businesses at market-defining companies such as LinkedIn, Cloudera, VMware, and Microsoft. Most recently, Tushar was vice president of products and design at Arimo, an Andreessen-Horowitz company building data intelligence products using analytics and AI.

Presentations

Privacy by design, not an afterthought: Best practices at LinkedIn Session

LinkedIn houses the most valuable professional data in the world. Protecting the privacy of member data has always been paramount. Shirshanka Das and Tushar Shanbhag outline three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement framework, and a unified data access layer.

Ben Sharma is CEO and cofounder of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions. With previous experience in technology leadership positions for NetApp, Fujitsu, and others, Ben’s expertise ranges from development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization, and storage. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes, and he holds two patents.

Presentations

The Argument for Data Centralization: A Telecommunications Use Case Session

Dubai-based telco leader, du, will discuss leveraging a centralized data lake to improve customer experience, create smart cities, address unexpected business challenges, and even enable data monetization. The session covers business outcomes, addresses technical challenges, includes architectural considerations, platform requirements for IoT, and performing root cause analyses.

Ofir Sharony is a senior member of MyHeritage’s backend team, where he is currently focused on building pipelines on-premises and in the cloud using batch and streaming technologies. An expert in building data pipelines, Ofir acquired most of his experience planning and developing scalable server-side solutions.

Presentations

From Kafka to BigQuery: A guide for delivering billions of daily events Session

What are the most important considerations for shipping billions of daily events to analysis? Ofir Sharony shares MyHeritage's journey to find a reliable and efficient way to achieve real-time analytics. Along the way, Ofir compares several data loading techniques, helping you make better choices when building your next data pipeline.

Hugo Sheng leads Qlik’s partner engineering organization, which is responsible for both developer relations and the integration of global technology partner solutions with the Qlik platform. Hugo has spent over 20 years in data management and analytics, specializing in highly scalable big data solutions. Over his career, he has held senior roles at Torrent Systems, Ascential Software, and IBM and led sales engineering and services at Expressor Software (acquired by Qlik in 2012). Hugo also spent several years in software development in the medical device industry. Hugo holds a BS in electrical engineering from the University of Houston and an MBA from the Jones Graduate School of Business at Rice University.

Presentations

Smart agriculture: Blending IoT sensor data with visual analytics on Apache Hive and Spark DCS

Mike Prorock and Hugo Sheng offer an overview of mesur.io, a game-changing climate awareness solution that utilizes Apache Hive, Spark, ESRI, and Qlik. Mesur.io combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Unravelling data at scale with Spark using deep learning and other algorithms from machine learning. Tutorial

We walk you through approaches available via machine-learning algorithms available in Spark ml to understand and decipher meaningful patterns in real-world data. Along with discussing the common problems encountered as the data and model sizes scale we will also leverage a few open source deep learning frameworks to run a few classification problems on image and text data sets leveraging Spark.

Fahd Siddiqui is a software engineer at Cloudera, where he’s working on cloud products, such as Cloudera Altus and Cloudera Director. Previously, Fahd worked at Bazaarvoice developing EmoDB, an open source data store built on top of Cassandra. His interests include highly scalable and distributed systems. He holds a master’s degree in computer engineering from the University of Texas at Austin.

Presentations

A deep dive into running big data workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Jason Wang, and Fahd Siddiqui lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud, highlighting cloud infrastructure best practices and illustrating how data engineering workloads interoperate with data analytic engines.

Vartika Singh is a field data science solutions architect at Cloudera with over 12 years of experience applying machine-learning techniques to big data problems.

Presentations

Unravelling data at scale with Spark using deep learning and other algorithms from machine learning. Tutorial

We walk you through approaches available via machine-learning algorithms available in Spark ml to understand and decipher meaningful patterns in real-world data. Along with discussing the common problems encountered as the data and model sizes scale we will also leverage a few open source deep learning frameworks to run a few classification problems on image and text data sets leveraging Spark.

Bargava Subramanian is a machine learning engineer based in Bangalore, India. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimization in Python and R around the world. He mentors early-stage startups in their data science journey. Bargava holds a master’s degree in statistics from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations

A recommendation system for wide transactions Session

Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide and outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud.

Engineering cloud-native machine learning applications Session

In the current Agile business environment, where developers are required to experiment multiple ideas and also react to various situations, doing cloud-native development is the way to go. Harjinder Mistry and Bargava Subramanian explain how to design and build a microservices-based cloud-native machine learning application.

Interactive visualization for data science Tutorial

One of the challenges in traditional data visualization is that they are static and have bounds on limited physical/pixel space. Interaction allows us to move beyond this limitation by adding layers of interactions. Bargava Subramanian and Amit Kapoor teach the art and science of creating interactive data visualizations.

Tzu-Li (Gordon) Tai is a software engineer at data Artisans and a committer for Apache Flink, where his contributions include work on Flink’s streaming connectors (Kafka, AWS Kinesis, Elasticsearch) and its type serialization stack and state management capabilities. Gordon is a frequent speaker at conferences such as Flink Forward, Flink meetups in Berlin and Taiwan, and several Taiwan-based conferences on the Hadoop ecosystem and data engineering in general.

Presentations

The stream processor as a database: Building event-driven applications with Apache Flink Session

Apache Flink is evolving from a framework for streaming data analytics to a platform that offers a foundation for event-driven applications that replaces the data management aspects that are typically handled by a database in more conventional architectures. Tzu-Li (Gordon) Tai explores the key features that are powering Flink's evolution, along with demonstrations of them in action.

Grace Tang leads experimental research advising for the APAC region at Uber. Previously, Grace was lead data scientist at online real estate platform 99.co. She holds a PhD in neuroscience from Stanford University, where she worked in the Decision Neuroscience Lab, studying how personality traits, emotions, and external stimuli affect decision making.

Presentations

Turning fails into wins Session

Being a data-driven company means that we have to move fast and fail often. But how do we learn to not only be proud of our failures but also turn these fails into wins? Grace Tang explains how to set up experiments so that negative results become epic wins, saving your team time, effort, and money instead of just being swept under the carpet.

Eric Tham is an associate lecturer at the National University of Singapore. Previously, he was an enterprise data scientist at Thomson Reuters, led the quantitative data science team in a Chinese fintech startup with five million users, and worked in the financial industry in risk management, quantitative development, and energy economics with banks and oil companies. Over his career, he has developed sentiment indices from social media data and is an expert in unstructured data analysis, NLP, and machine learning in financial applications. He is a frequent speaker at conferences and contributed a chapter to the Handbook of Sentiment Analysis in Finance.

Presentations

Practical applications for graph techniques in supply chain analysis and finance Session

Graphical techniques are increasingly being used for big data. These techniques can be broadly classified into the three C's: centrality, clustering, and connectedness. Eric Tham explains how these concepts are applied to supply chain analysis and financial portfolio management.

I’ve been working in data engineering for 8 years, in python, bash, ruby, C++ and Java. I have been using Hadoop commercially since 2011 and have built analytics and batch processing systems as well as data preparation tools for machine learning.

Presentations

The Trials of Machine Learning at Zendesk Session

Building a successful machine learning product is extremely challenging. It is easy to assume that building the model is most of the work. However, just as much effort is needed to turn that model into a customer facing product. We'll delve the various design challenges and real world problems when building a machine learning product at scale.

Wee Hyong Tok is a principal data science manager for the cloud AI team at Microsoft, where he works with teams to cocreate new value and turn each of the challenges facing organizations into compelling data stories that can be concretely realized using proven enterprise architecture. Wee Hyong has worn many hats in his career, including developer, program/product manager, data scientist, researcher, and strategist, and his range of experience has given him unique super powers to nurture and grow high-performing innovation teams that enable organizations to embark on their data-driven digital transformations using artificial intelligence. He has a passion for leading artificial intelligence-driven innovations and working with teams to envision how these innovations can create new competitive advantage and value for their business and strongly believes in story-driven innovation. He coauthored one of the first books on Azure Machine Learning, Predictive Analytics Using Azure Machine Learning, and authored another demonstrating how database professionals can do AI with databases, Doing Data Science with SQL Server.

Presentations

Bootstrap custom image classification using transfer learning Session

Transfer learning enables you to use pretrained deep neural networks (e.g., AlexNet, ResNet, and Inception V3) and adapt them for custom image classification tasks. Danielle Dean and Wee Hyong Tok walk you through the basics of transfer learning and demonstrate how you can use the technique to bootstrap the building of custom image classifiers.

Training and scoring deep neural networks in the cloud Session

Deep neural networks are responsible for many advances in natural language processing, computer vision, speech recognition, and forecasting. Danielle Dean and Wee Hyong Tok illustrate how cloud computing has been leveraged for exploration, programmatic training, real-time scoring, and batch scoring of deep learning models for projects in healthcare, manufacturing, and utilities.

Teresa Tung is a Managing Director at Accenture Technology Labs, where she is responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s industry platforms like Accenture Digital Connected Products and Accenture Analytics Insights Platform. Teresa holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

DevOps for models: How to manage millions of models in production—and at the Edge Session

As Accenture scaled to millions of predictive models, it required automation to ensure accuracy, prevent false alarms, and preserve trust. Teresa Tung, Ishmeet Grewal, and Jurgen Weichenberger explain how Accenture implemented a DevOps process for analytical models that's akin to software development—guaranteeing analytics modeling at scale and even in non-cloud environments at the edge.

Executive Briefing: Becoming a Data Driven Enterprise – A Maturity Model Session

A data-driven enterprise maximizes the value of its data. But how do enterprises emerging from technology and organization silos get there? We use our experience helping our clients through this journey to create a data-driven enterprise maturity model that spans technology and business requirements. We will walk through use cases that bring the model to life.

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

A deep dive into running big data workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Jason Wang, and Fahd Siddiqui lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud, highlighting cloud infrastructure best practices and illustrating how data engineering workloads interoperate with data analytic engines.

Sandeep is a Principal at ZS Associates and heads ZS’ Big Data practice. He has been helping enterprises for over 17 years to build cutting edge technology solutions. He is the chief architect and technology leader focused on Big Data and has helped clients shape their vision, define roadmaps, and deliver on large scale enterprise platforms.

Presentations

High performance enterprise data processing with Spark Session

We will talk about our experiences in building a very high performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance.

Arun Veettil currently work as a consultant, helping companies build custom made data science, machine learning and NLP solutions in the cloud. For the last seven years, Arun has been working at the intersection of machine learning and product development, helping companies develop intelligent data products. Previously, Arun worked at Starbucks, Point Inside, Nordstrom Advanced Analytics, the Walt Disney Company, and IBM. His expertise includes developing machine-learning algorithms to run against very large amounts of data and building large-scale distributed applications. Arun holds a master’s degree in computer science from the University of Washington and a bachelor’s degree in electronics engineering from National Institute of Technology, Allahabad, India.

Presentations

Architecting a text analytics system in the cloud Session

The speaker shares his experience and learning from developing a customized, enterprise level NLP platform which helped his client replace a leading text analytics vendor platform.

Carson Wang is a big data software engineer at Intel, where he focuses on developing and improving new big data technologies. He is an active open source contributor to the Apache Spark and Alluxio projects as well as a core developer and maintainer of HiBench, an open source big data microbenchmark suite. Previously, Carson worked for Microsoft on Windows Azure.

Presentations

An adaptive execution mode for Spark SQL Session

Spark SQL is one of the most popular components of Apache Spark. Carson Wang and Yucai Yu explore Intel's efforts to improve SQL performance and offer an overview of an adaptive execution mode they implemented for Spark SQL.

Jason is a software engineer at Cloudera, focusing on the cloud.

Presentations

A deep dive into running big data workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Jason Wang, and Fahd Siddiqui lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud, highlighting cloud infrastructure best practices and illustrating how data engineering workloads interoperate with data analytic engines.

Quanfu Wang is a senior architect on Intel’s big data team, where he is working on software optimization and acceleration on information architecture and heterogeneous computing. Previously, Quanfu was a lead software engineer at Alcatel-Lucent, where he worked for the company’s wireline business group.

Presentations

FPGA-based acceleration architecture for Spark SQL Session

Xie Qi and Quanfu Wang offer an overview of a configurable FPGA-based Spark SQL acceleration architecture that leverages FPGAs' very high parallel computing capability to tremendously accelerate Spark SQL queries and FPGAs' power efficiency to lower power consumption.

Zhenhua is a software engineer on AI & big data team, where he works on algorithm research & development in machine learning and computer vision, focusing on image feature/representation, large-scale image deduplication and searching.

Presentations

Bringing deep learning into big data analytics using BigDL Session

Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL and teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark.

Jürgen Weichenberger is a data science senior principal at Accenture Analytics, where he is currently working within resources industries with interests in smart grids and power, digital plant engineering, and optimization for upstream industries and the water industry. Jürgen has over 15 years of experience in engineering consulting, data science, big data, and digital change. In his spare time, he enjoys spending time with his family and playing golf and tennis. Jürgen holds a master’s degree (with first-class honors) in applied computer science and bioinformatics from the University of Salzburg.

Presentations

DevOps for models: How to manage millions of models in production—and at the Edge Session

As Accenture scaled to millions of predictive models, it required automation to ensure accuracy, prevent false alarms, and preserve trust. Teresa Tung, Ishmeet Grewal, and Jurgen Weichenberger explain how Accenture implemented a DevOps process for analytical models that's akin to software development—guaranteeing analytics modeling at scale and even in non-cloud environments at the edge.

Edd Wilder-James is a strategist at Google, where he is helping build a strong and vital open source community around TensorFlow. A technology analyst, writer, and entrepreneur based in California, Edd previously helped transform businesses with data as vice president of strategy for Silicon Valley Data Science. Formerly Edd Dumbill, Edd was the founding program chair for the O’Reilly Strata conferences and chaired the Open Source Convention for six years. He was also the founding editor of the peer-reviewed journal Big Data. A startup veteran, Edd was the founder and creator of the Expectnation conference-management system and a cofounder of the Pharmalicensing.com online intellectual-property exchange. An advocate and contributor to open source software, Edd has contributed to various projects such as Debian and GNOME and created the DOAP vocabulary for describing software projects. Edd has written four books, including O’Reilly’s Learning Rails.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data, AI, and data science have great potential for accelerating business, but how do you reconcile business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. John Akred and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business.

Executive Briefing: Preparing your infrastructure for AI Session

AI techniques depend on massive amounts of data and powerful infrastructure, so understanding the demands of these techniques is essential to planning investments in software, computing resources, and personnel. Edd Wilder-James presents a road map for executives who are beginning to consider their strategies for implementing artificial intelligence in their critical processes.

Executive Briefing: The business case for AI, Spark, and friends Session

AI is white-hot at the moment, but where can it really be used? Developers are usually the first to understand why some technologies cause more excitement than others. Edd Wilder-James relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring

Benjamin Wright-Jones is a solution architect in the Microsoft WW Services CTO Office for Data and AI, where his team helps enterprise customers solve their analytical challenges. Over his career, Ben has worked on some of the largest and most complex data-centric projects around the globe.

Presentations

The value of a data science center of excellence (COE) Session

As organizations turn to data-driven strategies, they are also increasingly exploring the creation of a data science or analytic center of excellence (COE). Benjamin Wright-Jones and Simon Lidberg outline the building blocks of a center of excellence and describe the value for organizations embarking on data-driven strategies.

Mingxi Wu is the vice president of engineering at TigerGraph, a Silicon Valley startup building a world-leading real-time graph data platform. During his 15-year career, Mingxi has focused on database research and data management software building; his recent interests are in building easy to use and high expressive graph query language. Previously, he worked in Microsoft’s SQL Server Group and Oracle’s Relational Database Optimizer Group. He has won research awards from the most prestigious publications in database and data mining, including SIGMOD, KDD, and VLDB. Mingxi holds a PhD from the University of Florida, where he specialized in databases and data mining.

Presentations

TigerGraph: A complete high-performance graph data and analytics platform Session

Mingxi Wu and Yu Xu offer an overview of TigerGraph, a high-performance enterprise graph data platform that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network and uncover the implicit patterns and critical insights to drive business growth.

Xiaochang Wu is a senior software engineer on Intel’s big data engineering team, where he helps deliver the best Spark performance on Intel platforms. Xiaochang has more than 10 years’ experience in performance optimization for Intel architecture. He holds a master’s degree in computer science from Xiamen University of China.

Presentations

Spark Structured Streaming helps smart manufacturing Session

Xiaochang Wu explains how to design and implement a real-time processing platform using the Spark Structured Streaming framework to intelligently transform production lines in the manufacturing industry.

YongLiang Xu is the lead data architect for SmartHub, the analytics division of StarHub, where he is responsible for transforming and architecting the next generation of big data architecture. His work includes reengineering SmartHub’s big data platform for real-time processing to support real-time machine learning and experimenting with new Apache projects and optimizing the big data platform for streamlined and seamless performance. Previously, YongLiang was a software engineer at DSO National Laboratories, Singapore, where he developed solutions based on big data technologies.

Presentations

Fusing a deep learning platform with a big data platform Session

SmartHub and NTT DATA have embarked on a partnership to design next-generation architecture to power the data products that will help generate new insights. YongLiang Xu and Masaru Dobashi explain how deep learning and other analytics models coexist within the same platform to address issues relating to smart cities.

Yu Xu is the founder and CEO of TigerGraph, the world’s first native parallel graph database. He is an expert in big data and parallel database systems and has over 26 patents in parallel data management and optimization. Previously, Yu worked on Twitter’s data infrastructure for massive data analytics and was Teradata’s Hadoop architect leading the company’s big data initiatives. Yu holds a PhD in computer science and engineering from the University of California San Diego.

Presentations

TigerGraph: A complete high-performance graph data and analytics platform Session

Mingxi Wu and Yu Xu offer an overview of TigerGraph, a high-performance enterprise graph data platform that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network and uncover the implicit patterns and critical insights to drive business growth.

I am a senior data engineer at Zendesk and I have been working on building machine learning products. I have more than 8 years experience in data processing, APIs and system integration since graduating with a PhD in Computer Vision

Presentations

The Trials of Machine Learning at Zendesk Session

Building a successful machine learning product is extremely challenging. It is easy to assume that building the model is most of the work. However, just as much effort is needed to turn that model into a customer facing product. We'll delve the various design challenges and real world problems when building a machine learning product at scale.

Wataru Yukawa is a data engineer at LINE, where he is creating and maintaining a log analysis platform based on Hadoop, Hive, Fluentd, Presto, and Azkaban and working on aggregating log and RDBMS data with Hive and reporting using BI tools.

Presentations

LINE's log analysis platform Session

Data is a very important asset to LINE, one of the most popular messaging applications in Asia. Wataru Yukawa explains how LINE gets the most out of its data using a Hadoop data lake and an in-house log analysis platform.

Yucai Yu is a software architect at Intel, where he works on Apache Spark upstream development and IA optimization. Previously, he worked at IBM and Citi Bank with a focus on OS, virtualization, storage, and data warehouses.

Presentations

An adaptive execution mode for Spark SQL Session

Spark SQL is one of the most popular components of Apache Spark. Carson Wang and Yucai Yu explore Intel's efforts to improve SQL performance and offer an overview of an adaptive execution mode they implemented for Spark SQL.