Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference
Singapore

Strata + Hadoop World 2016 Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Maojin Jiang is an instructor at Cloudera. Previously, Maojin worked as a big data engineer, a software engineer, a DevOps developer, a system administrator, and a researcher with interests in topic-sentiment analysis, information retrieval, web mining, political text analysis, machine learning, and natural language processing. In early 2012, Maojin introduced Cloudera Hadoop training into mainland China. Since then, he has dedicated himself to driving widespread adoption of Hadoop-based big data technologies by helping hundreds of engineers, architects, IT managers, executives, and university students and their teachers gain knowledge of Hadoop-based big data technologies, including Cloudera’s industry-leading and world-recognized best practices and solutions.

Presentations

Data science at scale: Using Spark and Hadoop 2-Day Training

Maojin Jiang demonstrates how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Through in-class simulations and exercises, Maojin walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field.

TRAINING: Data science at scale: Using Spark and Hadoop (Day 2) Training Day 2

Maojin Jiang demonstrates how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Through in-class simulations and exercises, Maojin walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field.

Yantisa Akhadi is a project manager for the Humanitarian OpenStreetMap team, whose mission is to promote the use of OpenStreetMap, QGIS, and InaSAFE in humanitarian response and economic development throughout Indonesia. Yantisa has a strong background in free and open source software and the FOSS community. In the past eight years, Yantisa has been involved in information management for various natural disasters in Indonesia.

Presentations

OpenStreetMap for urban resilience Tutorial

The use of maps in disaster response is evidently important. Yantisa Akhadi explores how to use OpenStreetMap (OSM), the biggest crowdsourced mapping platform, for safer urban environments, drawing on case studies from several major cities in Indonesia where citizen and government mapping has played a major role in improving resilience.

Tyler Akidau is a senior staff software engineer at Google Seattle. He leads technical infrastructure’s internal data processing teams in Seattle (MillWheel & Flume), is a founding member of the Apache Beam PMC, and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102.” His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Learn stream processing with Apache Beam Tutorial

Tyler Akidau, Slava Chernyak, and Dan Halperin offer a guided walkthrough of Apache Beam (incubating)—the most sophisticated and portable stream processing model on the planet—covering the basics of robust stream processing (windowing, watermarks, and triggers) with the option to execute exercises on top of the runner of your choice (Flink, Spark, or Google Cloud Dataflow).

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and John Akred explain how to create a modern data strategy that powers data-driven business.

Office Hour with John Akred (Silicon Valley Data Science) Office Hours

Stop by and talk with John Akred if you want to build a strong data strategy.

The business case for Spark, Kafka, and friends Session

Spark is white-hot, but why does it matter? Some technologies cause more excitement than others, and at first the only people who understand why are the developers who use them. John Akred offers a tour through the hottest emerging data technologies of 2016 and explains why they’re exciting, in the context of the new capabilities and economies they bring.

What's your data worth? Session

The unique properties of data make it difficult to assess its overall value using traditional valuation approaches. John Akred discusses a number of alternative approaches to valuing data within an organization for specific purposes so that you can optimize decisions around its acquisition and management.

Antonio Alvarez is the head of data innovation at Isban UK, which aims to spearhead the transformation to a data-driven organization through digital technology. In partnership with the CDO, Antonio is creating a collaborative environment where innovative strategies and propositions around data from all sides of the business can create value for customers more quickly. With a quick pace of adoption, Santander UK is now implementing different frameworks for scaling and broadening the impact of data to disrupt the bank from inside through the adoption of a guided self-service approach. Antonio has a background in economics and 18 years of experience in financial services across four countries in business, technology, change, and data.

Presentations

First mover or fast follower? Santander UK's big data journey Session

Santander was one of the last big banks in the UK to start using Hadoop and other big data technologies. However, the maturity of the technology made it possible to create a customer-facing data product in production in less than a year and a fully adopted production analytics platform in less than two. Antonio Alvarez shares what other late entrants can learn from this experience.

Franz Aman is senior vice president of brand and demand at Informatica, where he is responsible for branding, global demand generation, marketing operations, content, and digital marketing. Previously, Franz held numerous executive positions within industry-leading technology companies, including SAP, BusinessObjects, BEA Systems, SGI, and Sun Microsystems. He has more than 20 years of experience in leadership and innovation across marketing, including global product marketing, product management, strategy, brand, and communications. Franz holds a degree in geophysics from Ludwig-Maximilians University, Munich, Germany.

Presentations

How to use a marketing data lake for data-driven marketing Session

Marketing has become ever more data driven. While there are thousands of marketing applications available, it is challenging to get an end-to-end line of sight and fully understand customers. Franz Aman explains how bringing the data from the various applications and data sources together in a data lake changes everything.

Sarang Anajwala is technical product manager for Autodesk’s next-generation data platform, where he focuses on building the self-service platform for data analysis and data products. Sarang has extensive experience in data architecture and data strategy. Prior to this, he has worked as an architect building robust big data systems, including a next-generation platform for contextual communications and a data platform for IoT. He has filed 3 patents for his innovations in contextual communications space and adaptive designs space.

Presentations

From application to platform: The strategic shift in approach toward data analytics Tutorial

Sarang Anajwala discusses Autodesk’s next-generation data platform and its transition from an application for usage analytics to a platform for data analytics that provides capabilities like self-service ETLs, data exploration, multitenant data apps, and data products. This versatile platform supports use cases right from dashboards to data science, helping the move into a data-centric future.

Amr Awadallah is cofounder and CTO of Cloudera. Previously, Amr was an entrepreneur in residence at Accel Partners and vice president of engineering at Yahoo, where he led a team that used Apache Hadoop extensively for data analysis and business intelligence across the Yahoo online services. Amr joined Yahoo, through the acquisition of his first startup, VivaSmart. Amr holds a bachelor’s and a master’s degree in electrical Eengineering from Cairo University, Egypt, and a PhD in electrical engineering from Stanford University.

Presentations

The new dynamics of big data Keynote

Since its inception, big data solutions have best been known for their ability to master the complexity of the volume, variety, and velocity of data. But as we enter the era of data democratization, there’s a new set of concerns to consider. Amr Awadallah discusses the new dynamics of big data and explains how a renewed approach focused on where, who, and why can lead to cutting-edge solutions.

Nenshad Bardoliwalla is the founding vice president of products at Paxata, where he is responsible for product strategy, product management, and product marketing. Nenshad is an executive and thought leader with a proven track record of success leading product strategy, product management, and development in business analytics. Previously, he cofounded Tidemark Systems, Inc., where he drove the market, product, and technology efforts for its next-generation analytic applications built for the cloud through its series C funding; served as vice president for product management, product development, and technology at SAP, where he helped to craft the business analytics vision, strategy, and roadmap that led to the acquisitions of Pilot Software, OutlookSoft, and Business Objects, resulting in SAP’s market leadership in the overall business analytics market; and helped launch Hyperion System 9 while at Hyperion Solutions, which cemented Hyperion’s leadership in the corporate performance management space. Nenshad began his career at Siebel Systems working on Siebel Analytics, which became Siebel’s second-largest product line and the leader in customer analytic applications. The Siebel and Hyperion product lines now comprise Oracle’s flagship EPM and BI offerings. Nenshad is the coauthor of Driven to Perform: Risk-Aware Performance Management from Strategy through Execution from Evolved Technologist Press.

Presentations

A day in the life of a chief data officer (sponsored) Session

Join in to meet four experts who will share their views of the people, processes, and technologies that are driving information transformation around the world, including machine learning, big data, the cloud, and distributed computing. Find out why the role of chief data officer is at the center of driving tangible business value from data across the enterprise.

Joerg Blumtritt is the founder and CEO of Datarella, a computational social science startup delivering mobile analytics, self-tracking solutions, and data science consulting. After graduating from university with a thesis on machine learning, Joerg worked as a researcher in behavioral sciences, focused on nonverbal communication. His projects have been funded by an EU commission, the German federal government, and the Max Planck Society. He subsequently ran marketing and research teams for TV networks ProSiebenSat.1 and RTL II and magazine publisher Hubert Burda Media. As European operations officer at Tremor Media, Joerg was in charge of building the New York-based video advertising network’s European enterprises. More recently, he was managing director of MediaCom Germany. Joerg is the founder and chairman of the German Social Media Association (AG Social Media) and the coauthor of the Slow Media Manifesto. Joerg blogs about big data and the future of social research at Beautifuldata.net and about the Quantified Self at Datarella.com.

Presentations

Algorithmic art and data creativity Session

Data already plays an important role as raw material for art, from algorithmic visualization and parametric architecture to works created entirely by autonomous machines. With data-driven art, data science now touches even the most human aspects of culture. Heather Dewey-Hagborg and Joerg Blumtritt share examples and discuss possible routes for future data art.

Data ethics Session

Data ethics covers more than just privacy. In a connected world where most people rely on data-driven services, opting out and locking data away is hardly an option. More important than keeping data private is ensuring fairness and preventing abuse. Joerg Blumtritt and Heather Dewey-Hagborg show how to deal with data in an ethical way that has sound economic value.

Nicolette Bullivant is head of data engineering in the Data Innovation department of Isban UK, where she is responsible for managing several big data implementations. Nici is a technical manager with 15 years’ experience in the IT services industry, with wide-ranging technical and functional skills and has spent most of her career working with data. She previously led large-scale, multilocation change projects across multiple locations comprising of data provision, managed MI and data warehouses, ETL, system integration, and IT alignment.

Presentations

Support digital applications with a resilient, highly available, and NRT Hadoop backend Session

Jorge Pablo Fernandez and Nicolette Bullivant explore Santander Bank's Spendlytics app, which helps customers track their spending by offering a listing of transactions, transactions aggregations, and real-time enrichment based on the categorization of transactions depending on market and brands. Along they way, they share the challenges encountered and lessons learned while implementing the app.

Raymond Chan is the principal data scientist at the Tao of Shop, where he champions data science and analytics techniques. As part of a startup applying Agile methodology, he works with a number of software engineers to fully integrate his algorithms into productions systems with little hierarchical bureaucracy. His primary expertise is in building mathematical models for decision making during movement, which he developed as a postdoc at Baylor College of Medicine and a PhD student at the Max Planck Institute for Mathematics in the Sciences in Leipzig, Germany. When not data sciencing at his day job, Raymond is a core lead with DataKind SG, donating his time to NGOs that need data science or analytics support, such as the Committee for UN Women, Phandeeyar—the Myanmar tech hub, and HOME (Humanitarian Organization for Migration Economics), based in Singapore. Raymond holds a PhD in Informatics from the University of Leipzig.

Presentations

DataKind SG: Dispatches from the front line of data-driven social development Session

Raymond Chan dives into the trials and tribulations of DataKind SG, a data science consulting social good organization operating in the digitally underserved but rapidly developing frontier of Southeast Asia.

Ranveer Chandra is a principal researcher at Microsoft Research, where he leads an incubation on IoT applications, with a focus on agriculture. Ranveer also leads research projects on white space networking, low-latency wireless, and improving battery life of mobile devices. He has published more than 60 research papers and filed over 100 patents, 65 of which have been granted. His technology has shipped as part of Windows 7, Windows 8, Windows 10, Xbox, Visual Studio, and Windows Phone. Ranveer has won several awards, including the MIT Technology Review’s Top Innovators Under 35, best paper awards at ACM CoNext 2008, ACM SIGCOMM 2009, IEEE RTSS, and USENIX ATC, and the Microsoft Graduate Research Fellowship, and was recognized as a Fellow in Communications of the World Technology Network. Ranveer holds an undergraduate degree from IIT Kharagpur, India, and a PhD in computer science from Cornell University.

Presentations

Dancing with intelligent dragon drones Session

Jennifer Marsman, Ranveer Chandra, and Wee Hyong Tok explore the various drone technologies that are currently available and explain how to acquire and analyze real-time signals from drones to design intelligent applications.

Raju Chellam is the head of big data and cloud for the Healthcare and Government practice in South Asia at Dell EMC. Raju is a member of the Singapore National Cloud Advisory Panel, the former honorary chair of the Cloud Outage Incidence Response Group of IDA, honorary secretary of the Cloud and Big Data chapter of the Singapore IT Federation, and honorary secretary of the Cloud chapter at the Singapore Computer Society.

Presentations

Bake your big data pie with HPC and AI (sponsored) Session

The future of big data is AI, and the future is here. With machine learning and deep learning, with robotics and heuristics, with fuzzy logic and AI, the convergence is changing myriad industries like healthcare, banking, insurance, and gaming. Raju Chellam explains why it’s time to step back and consider how big data with HPC and AI can make a key difference in your management.

Slava Chernyak is a senior software engineer at Google. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Google Cloud Dataflow Streaming from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.

Presentations

Learn stream processing with Apache Beam Tutorial

Tyler Akidau, Slava Chernyak, and Dan Halperin offer a guided walkthrough of Apache Beam (incubating)—the most sophisticated and portable stream processing model on the planet—covering the basics of robust stream processing (windowing, watermarks, and triggers) with the option to execute exercises on top of the runner of your choice (Flink, Spark, or Google Cloud Dataflow).

Watermarks and triggers: Time and progress in Apache Beam (incubating) and beyond Session

Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications.

Flavio Clesio is specialist in machine learning and revenue assurance at Movile, where he helps build core intelligent applications to exploit revenue opportunities and automation in decision making. Prior to Movile, Flavio was a business intelligence consultant in financial markets, specifically in nonperforming loans. He holds a master’s degree in computational intelligence applied in financial markets.

Presentations

Machine learning in practice with Spark MLlib: An intelligent data analyzer Session

Can you imagine an intelligent software to assist in your decision making and drive actions? Flavio Clesio and Eiti Kimura offer a practical demonstration of using machine learning to create an intelligent monitoring application based on a distributed system data analysis using Apache Spark MLlib.

Stuart Coleman is a data scientist at Lloyds Banking Group working in fraud prevention. Stuart has seven years of experience working as a data scientist in both the finance and startup spaces. Previously, he was a data scientist at Growth Intelligence, a predictive marketing company; the founder of a startup delivering personalized learning for high school students; a quantitative analyst at UBS, where he modeled the equity derivatives market; and a research fellow at Imperial College working in turbulent flows. Stuart holds a PhD in fluid flow in complicated geometries from Imperial College.

Presentations

The 12 stations of becoming a data-centric organization Tutorial

Lyudmila Lugovskaya and Stuart Coleman discuss some of the many challenges that organizations face on their journey to become data-centric and share lessons learned from their experience doing and promoting data science within organizations of different type and size while dealing with restrictions imposed by traditional governance structures and policies.

Lisa Collins is a principal consultant for Adobe’s Marketing Cloud, where she provides strategic consulting advice in the areas of digital analytics, transformation, strategy, marketing, business measurement, and data-driven maturity. Over the last decade, Lisa has worked with a wide range of customers across APAC and EMEA and a diverse range of industries from finance through to retail, media, telecommunications, and travel.

Presentations

Data storytelling: Craft meaningful stories that drive action Session

To communicate high-value, data-driven insights, analysts need stories, but data storytelling is difficult. Lisa Collins demonstrates how to build a compelling story with the right blend of narrative, data, and visualization.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata + Hadoop World conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Data science and critical thinking Session

Join Lean Analytics author, Harvard lecturer, and Strata chair Alistair Croll for a look at how to think critically about data, based on his Harvard Business School course.

Thursday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Shannon Cutt is the development editor in the data practice area at O’Reilly Media.

Presentations

Office Hour with Shannon Cutt and Paco Nathan (O'Reilly Media) Office Hours

Have you always wanted to become an author? Shannon and Paco will talk with you about upcoming ideas and projects that O'Reilly is looking to publish.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Jason is currently a Sr. Principle Engineer and CTO, Big Data Technologies, at Intel, responsible for leading the global engineering teams (located in both Silicon Valley and Shanghai) on the development of advanced Big Data analytics (incl. distributed machine / deep learning), as well as collaborations with leading research labs (e.g., UC Berkeley AMPLab).

He is an internationally recognized expert on big data, cloud and distributed machine learning; he is the program co-chair of Strata Data Conference Beijing, a committer and PMC member of Apache Spark project, and the creator of BigDL (https://github.com/intel-analytics/BigDL/) project, a distributed deep learning framework on Apache Spark.

Presentations

Web-scale machine learning on Apache Spark Session

Jason Dai and Yiheng Wang share their experience building web-scale machine learning using Apache Spark—focusing specifically on "war stories" (e.g., in-game purchase, fraud detection, and deep leaning)—outline best practices to scale these learning algorithms, and discuss trade-offs in designing learning systems for the Spark framework.

Devin Deen is the director of data and analytics at enterprise IT (e-IT), where he ensures that customers can easily navigate the complexities of implementing data warehouses, data management, business intelligence, and business analytic tools and solutions. With experience ranging from managing multimillion dollar projects to leading teams and starting businesses across the healthcare, education, telecommunication, financial services, and utilities sectors, Devin attributes his success to being proactive and having an inclusive management style. His 25 year career has included roles in the United States Marine Corps and the Simpl Group. Most recently, Devin led the team at niche BI consulting firm Altis for over 10 years. Outside of e-IT, he is active in the NZ tech startup community and directly involved with successful SaaS company ProjectManager.com.

Presentations

Act on insight with the IoT Session

Embedding operational analytics with the IoT enables organizations to act on insights in real time. Devin Deen and Dnyanesh Prabhu walk you through examples from Sky TV and NZ Bus—two businesses that iteratively developed their analytic capabilities integrating the IoT on Hadoop, allowing people and process changes to keep pace with technical enablement.

Heather Dewey-Hagborg is a NYC-based transdisciplinary artist and educator interested in art as research and critical practice; she is currently an assistant professor of art and technology studies at the School of the Art Institute of Chicago. Heather has shown work internationally at events and venues including the World Economic Forum, Ars Electronica in Linz, the Shenzhen Urbanism and Architecture Bienniale, Poland Mediations Bienniale, Article Biennial in Norway, the Science Gallery Dublin, Transmediale in Berlin, Fotomuseum Winterthur, Centre de Cultura Contemporània de Barcelona, ZKM Museum of Contemporary Art in Germany, Museum Boijmans, Van Abbemuseum, and MU Art Space in the Netherlands and has exhibited nationally at PS1 Moma, the New Museum, Eyebeam, the New York Public Library, and the Utah Museum of Contemporary Art, among many others. In addition to her individual work, Heather has collaborated with the collective Future Archaeology, with video artist Adriana Varella, and with artists Thomas Dexter, Aurelia Moser, Allison Burtch, and Adam Harvey.

Heather’s work has been featured in print in the New Yorker, the New York Times, Paper magazine, Arts Asia Pacific, the Wall Street Journal, the Times of London, Newsweek, New Scientist, Popular Science, Il Sole 24 Ore, Science magazine, and C Magazine, as well as on the cover of Government Technology; on television on CNN, Dan Rather Reports, the BBC World Service, ZDF in Germany, and Fuji and Freed Television in Japan, Channel One, RTR and Lenta in Russia, Norwegian Broadcasting; on the radio on Public Radio’s Science Friday, Studio 360, and CBS News; and online in the New York Times Magazine, TED, the Guardian, the New Inquiry, Reuters, the New York Post, NPR, Wired, Smithsonian, Le Monde, Haaretz, The Creators Project, neural.it, Art Ukraine, designboom, Capital New York, Artlog, Rhizome, Fast Company, The Verge, Motherboard, the Boston Globe, Huffington Post, Gizmodo, and the Daily Beast, among many others. Heather has given workshops and talks at museums, schools, conferences, and festivals, including MoMA, TEDxVienna, SxSW, Eyeo, the Broad Institute of Harvard and MIT, the Media Lab, the Woodrow Wilson Policy Center, Bio-IT World, the Norwegian Biotechnology Advisory Board, and LISA. Heather has received grants, residencies, or awards from Creative Capital, Eyebeam, MOMA PS1, Ars Electronica, Vida Art and Artificial Life Competition, Clocktower Gallery, Jaaga, I-Park, Sculpture Space, the Foundation for Contemporary Arts, CEPA Gallery, the Nathan Cummings Foundation, and the National Science Foundation. Heather holds a BA in information arts from Bennington College, a master’s degree from the Interactive Telecommunications program at NYU’s Tisch School of the Arts, and a PhD in electronic arts from Rensselaer Polytechnic Institute.

Presentations

Algorithmic art and data creativity Session

Data already plays an important role as raw material for art, from algorithmic visualization and parametric architecture to works created entirely by autonomous machines. With data-driven art, data science now touches even the most human aspects of culture. Heather Dewey-Hagborg and Joerg Blumtritt share examples and discuss possible routes for future data art.

Data ethics Session

Data ethics covers more than just privacy. In a connected world where most people rely on data-driven services, opting out and locking data away is hardly an option. More important than keeping data private is ensuring fairness and preventing abuse. Joerg Blumtritt and Heather Dewey-Hagborg show how to deal with data in an ethical way that has sound economic value.

Masaru Dobashi is a system infrastructure engineer and leads the OSS professional service team at NTT DATA Corporation. Masaru developed an enterprise Hadoop cluster consisting of over 1,000 nodes in 2009, which was one of the largest Hadoop clusters in Japan at the time. After that, he designed and provisioned several kinds of clusters using nonHadoop OSS, such as Spark and Storm. Masaru is now responsible for introducing Hadoop, Spark, Storm, and other OSS middlewares into enterprise systems and developing data processing systems.

Presentations

IoT and Spark MLlib applications for improving products, services, and manufacturing technologies Session

IHI has developed a common platform for remote monitoring and maintenance and has started leveraging Spark MLlib to get up speed developing applications for process improvement and product fault diagnosis. Yoshitaka Suzuki and Masaru Dobashi explain how IHI used PySpark and MLlib to improve its services and share best practices for application development and lessons for operating Spark on YARN.

Wolff Dobson is a developer programs engineer at Google specializing in machine learning and games. Before Google, he worked as a game developer, where his projects included writing AI for the NBA 2K series and helping design the Wii Motion Plus. Wolff holds a PhD in artificial intelligence from Northwestern University.

Presentations

Ask me anything: TensorFlow AMA

Wolff Dobson answers your questions about TensorFlow.

Deep learning with TensorFlow Tutorial

Wolff Dobson walks you through training and deploying a machine-learning system using TensorFlow, a popular open source library, and demonstrates how to build machine-learning systems from simple classifiers to complex image-based models.

TensorFlow at Google: Research, products, and art Keynote

Machine learning and artificial intelligence show great promise, but, really, machine learning and deep learning are already here and being used everywhere around you. Find out how Google uses large-scale machine learning in many of its products, and how TensorFlow and ML can help your business (and even help you make art and music).

Mathieu Dumoulin is a data scientist in MapR Technologies’s Tokyo office, where he combines his passion for machine learning and big data with the Hadoop ecosystem. Mathieu started using Hadoop from the deep end, building a full unstructured data classification prototype for Fujitsu Canada’s Innovation Labs, a project that eventually earned him the 2013 Young Innovator award from the Natural Sciences and Engineering Research Council of Canada. Afterward, he moved to Tokyo with his family where he worked as a search engineer at a startup and a managing data scientist for a large Japanese HR company, before coming to MapR.

Presentations

A simplified enterprise architecture for real-time stream processing Session

Mathieu Dumoulin offers an overview of stream processing and explains how to simplify a seemingly complex real-time enterprise streaming architecture using an open source business rules engine and Apache Kafka API streaming. Mathieu then illustrates this architecture with a demo based on a successful production use case for Busan, South Korea's Smart City initiative.

Architecting a hybrid cloud application using a global publish-subscribe streaming message system Session

Hybrid cloud architectures marry the flexibility to scale workloads on-demand in the public cloud with the ability to control mission-critical applications on-premises. Publish-subscribe message streams offer a natural paradigm for hybrid cloud use cases. Mathieu Dumoulin describes how to architect a real-time, global IoT analytics hybrid cloud application with a Kafka-based message stream system.

Ted Dunning has been involved with a number of startups—the latest is MapR Technologies, where he is chief application architect working on advanced Hadoop-related technologies. Ted is also a PMC member for the Apache Zookeeper and Mahout projects and contributed to the Mahout clustering, classification, and matrix decomposition algorithms. He was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Presentations

A stream-first approach to drive real-time applications (sponsored) Session

Ted Dunning explains how a stream-first approach simplifies and speeds development of applications, resulting in real-time applications that have significant impact. Along the way, Ted contrasts a stream-first approach with existing approaches that start with an application that dictates specialized data structures, ETL activities, data silos, and processing delays.

Modern telecom analytics with streaming data Session

Modern telecommunications are alphabet soups that produce massive amounts of diagnostic data. Ted Dunning offers an overview of a real-time, low-fidelity simulation of the edge protocols of such a system to help illustrate how modern big data tools can be used for telecom analytics. Ted demos the system and shows how several tools can produce useful analytical results and system understanding.

Office Hour with Ted Dunning (MapR Technologies) Office Hours

Ted will talk about streaming architecture, micro-services, how to build high performance systems, open source, math, or machine learning.

Mateusz Dymczyk is a Japan-based software engineer at H20.ai, where he works as a researcher on machine learning and NLP projects. Previously, he worked at Fujitsu Laboratories. Mateusz loves all things distributed and machine learning and hates buzzwords. In his spare time, he participates in the IT community by organizing, attending, and speaking at conferences and meetups. Mateusz holds an MSc in computer science from AGH UST in Krakow, Poland.

Presentations

Deep learning at scale Session

Deep learning has made a huge impact on predictive analytics and is here to stay, so you'd better get up to speed with the neural net craze. Mateusz Dymczyk explains why all the top companies are using deep learning, what it's all about, and how you can start experimenting and implementing deep learning solutions in your business in only a few easy steps.

Susan Etlinger is an industry analyst at Altimeter Group, where she works with global companies to develop both social and data intelligence strategies that support their business objectives. Susan has a diverse background in marketing and strategic planning within both corporations and agencies. She’s a frequent speaker on social media and analytics and has been extensively quoted in outlets including Fast Company, the BBC, the New York Times, and the Wall Street Journal. You can find her on Twitter at @setlinger and at her blog, Thought Experiments.

Presentations

Image intelligence: Making visual content predictive Keynote

Susan Etlinger lays out the market opportunities, challenges, and use cases for image intelligence and offers recommendations for organizations that wish to unlock the predictive potential of visual content.

Sameer Farooqui is a client services engineer at Databricks, where he works with customers on Apache Spark deployments. Sameer works with the Hadoop ecosystem, Cassandra, Couchbase, and general NoSQL domain. Prior to Databricks, he worked as a freelance big data consultant and trainer globally and taught big data courses. Before that, Sameer was a systems architect at Hortonworks, an emerging data platforms consultant at Accenture R&D, and an enterprise consultant for Symantec/Veritas (specializing in VCS, VVR, and SF-HA).

Presentations

Spark camp: Exploring Wikipedia with Spark Tutorial

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Sameer Farooqui explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.

Clara Fletcher is a senior manager and technical architect at Accenture. Clara comes from a broad background that includes econometric forecasting, complex event processing, and infrastructure design and enterprise data provisioning. She is actively involved with emerging big data technologies, industry interest groups, and volunteer education programs. She has won the National Service Trust award, served as a Hackbright mentor, and holds a patent in digital document verification technology. Clara also instructs the Accenture hands-on big data course and is the lead of the online NoSQL course development.

Presentations

Next-generation data governance Session

Implementing a data governance strategy that is agile enough to take on the new technical challenges of big data while being robust enough to meet corporate standards is a huge, emerging challenge. Clara Fletcher explores what next-generation data governance will look like and what the trends will be in this space.

Michael Freeman is a lecturer at the University of Washington Information School, where he teaches courses on data visualization and web development. With a background in public health, Michael works alongside research teams to design and build interactive data visualizations to explore and communicate complex relationships in large datasets. His freelance work ranges from web design to software consulting. You can take a look at samples from his projects here.

Presentations

Writing reusable visualization software with D3.js: Part I Session

Stop copying and pasting your D3.js visualization code each time you start a new project and start writing intelligent visualization software. Michael Freeman demonstrates how to build modular, reusable charting code by leveraging foundational JavaScript principles (such as closures) and the reusability structure used internally by the D3.js library.

Writing reusable visualization software with D3.js: Part II Session

Stop copying and pasting your D3.js visualization code each time you start a new project and start writing intelligent visualization software. Michael Freeman demonstrates how to build modular, reusable charting code by leveraging foundational JavaScript principles (such as closures) and the reusability structure used internally by the D3.js library.

Maosong Fu is the technical lead for ​Heron and ​real-time analytics at Twitter and the author of ​few publications in the distributed area​. Maosong holds a master’s degree from Carnegie Mellon University and bachelor’s from Huazhong University of Science and Technology.

Presentations

Twitter's real-time stack: Processing billions of events with Heron and DistributedLog Session

Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Maosong Fu offers an overview of the end-to-end real-time stack Twitter designed in order to meet this challenge, consisting of DistributedLog (the distributed and replicated messaging system) and Heron (the streaming system for real-time computation).

Andrea Gagliardi La Gala is a data solution architect at Microsoft, where he helps organizations gain a competitive edge by leveraging cloud-based big data and machine-learning technologies. Andrea has 16 years’ experience in IT, delivering large-scale software solutions across a range of sectors and focusing on distributed computing frameworks and analytics.

Presentations

How Mediacorp has leveraged Apache Spark and Microsoft Cloud to analyze patterns of user behavior for actionable insights Session

Mediacorp analyzes its online audience through a computationally and economically efficient cloud-based platform. The cornerstone of the platform is Apache Spark, a framework whose clean APIs and performance gains make it an ideal choice for data scientists. Andrea Gagliardi La Gala and Brandon Lee highlight the platform’s architecture, benefits, and considerations for deploying it in production.

Adam Gibson is the CTO and cofounder of Skymind, a deep learning startup focused on enterprise solutions in banking and telco, and the coauthor of Deep Learning: A Practitioner’s Approach.

Presentations

Deep reinforcement learning on Spark Session

Adam Gibson offers a brief overview of deep reinforcement learning on Spark, exploring how to run large-scale training on Spark and the implications on deep reinforcement learning targeting the doom environment.

David Gledhill is group chief information officer and head of group technology and operations at leading Asian bank DBS Bank, recently named world’s best digital bank, where he manages about 10,000 professionals across the region and is focused on strengthening the bank’s technology and infrastructure platform to drive greater resilience, organizational flexibility, and innovation. Executing against DBS’s strategy to be at the forefront of digital transformation, David also plays a lead role in driving the bank’s innovation agenda, which encompasses design thinking, Agile methodology, data analytics, fintech partnerships, hackathons, and more. David also has responsibility for the group’s operations, which is focused on reimagining customer journeys and the way business is supported so as to make banking simpler and more effortless for customers, and oversees procurement and real estate initiatives. David brings with him over 25 years of experience in the financial service industry, 20 of which were spent in Asia. Previously, David spent 20 years at JPMorgan, holding senior regional positions in technology and operations. David is a director of Singapore Clearing House Pte. Ltd., a member of IBM’s advisory board, and a member of the National Super Computing Centre steering committee. He is also a board advisor to Singapore Management University’s School of Information Systems. A British citizen, David holds a bachelor of science in computing and electronics from the University of Durham.

Presentations

Big data, big value for smart banking at DBS Keynote

Mike Olson explores the latest trends on how organizations are using big data to drive board-level business decisions. Mike will be joined by Dave Gledhill, who will share how DBS is using big data and customer 360 to improve customer experience and drive ATM network and customer call center operations efficiency.

Gopal GopalKrishnan is a solution architect in the Partners & Strategic Alliances group at OSIsoft. Gopal has been working with OSIsoft’s PI System since the mid-1990s in software development, technical and sales support, and field services. Previously, he was a product manager with a focus on enterprise and asset integration and PI data access. Gopal is a registered professional engineer in Pennsylvania. He is a member of the MESA technical Committee, the Education Committee, and the MESA Continuous Process Industry Special Interest Group. He actively participates in topics such as big data, data mining, energy efficiency, manufacturing intelligence, and sustainability (including green initiatives in facilities and data centers). Gopal holds a master’s degree in engineering and continuing education in business administration.

Presentations

Industrial big data and sensor time series data: Different but not difficult—Part II Session

Picking up where his talk at Strata + Hadoop World in London left off, Gopal GopalKrishnan shares lessons learned from using components of the big data ecosystem for insights from industrial sensor and time series data and explores use cases in predictive maintenance, energy optimization, process efficiency, production cost reduction, and quality improvement.

Ferry Grijpink is a partner at McKinsey & Company and leader of McKinsey’s Telecommunications, Media, and Technology practice in Southeast Asia, where he focuses on advising clients on technology, strategy, marketing, and operations. Ferry has served leading telecom, technology, and media companies in Europe, Africa, and Asia on a wide range of issues related to digital transformation, operational improvement, and technology. Some of his recent work includes developing a digital roadmap for an Asian integrated incumbent, developing a smart city strategy for an ICT service provider, and shaping a mobile financial services strategy for a global operator. In addition to serving clients, Ferry also coleads McKinsey’s research in deploying and commercializing next-generation infrastructures, such as fiber, mobile broadband, and the Internet of Things.

Previously, Ferry worked for Gemini Consulting in the high-tech consulting unit, where he served consumer electronics and semiconductor companies. Ferry is an active entrepreneur in the mobile Internet space. He has written numerous articles on big data in telecoms, CTO agenda, 4G, mobile OTT, and frequency auctions and frequently speaks to industry groups and at events, such as the Mobile World Congress and CommunicAsia, among others, on technology trends and on the impact of technologies and innovation on society, operator strategies, and the IoT. Ferry holds an MSc in electrical engineering with a major in telecommunications from the Delft University of Technology.

Presentations

Where machines can replace humans and where they can't—yet Keynote

Automation technologies such as machine learning and robotics play an increasingly important role in everyday life, and their potential effect on the workplace has become a cause for concern. Ferry Grijpink explores which jobs will or won't be replaced by machines.

Mark Grover is a software engineer working on Apache Spark at Cloudera. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating) and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Presentations

Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing Tutorial

Mark Grover, Ted Malaska, and Jonathan Seidman explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world and discuss how to use components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures AMA

Mark Grover, Jonathan Seidman, and Ted Malaska, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Top five mistakes when writing Spark applications Session

Ted Malaska and Mark Grover cover the top five things that prevent Spark developers from getting the most out of their Spark clusters. When these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters and the same data, using just a different approach.

Alex Gutow is senior product marketing manager at Cloudera, focused on the analytic database platform solution and technologies. Prior to Cloudera, she managed technical marketing and PR for Basho Technologies and managed consumer and enterprise marketing for Truaxis, a MasterCard company. Alex holds a BS in marketing and BA in psychology from Carnegie Mellon University.

Presentations

BI and SQL analytics with Hadoop in the cloud Session

Alex Gutow and Henry Robinson explain how Apache Hadoop and Apache Impala (incubating) take advantage of the benefits of the cloud to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments combined with the flexibility and cost efficiency of the cloud.

Dan Halperin is a PPMC member and committer on Apache Beam (incubating). He has worked on Beam and Google Cloud Dataflow for 18 months. Previously, he was the director of research for scalable data analytics at the University of Washington eScience Institute, where he worked on scientific big data problems in oceanography, astronomy, medical informatics, and the life sciences. Dan holds a PhD in computer science and engineering from the University of Washington.

Presentations

Apache Beam: A unified model for batch and streaming data processing Session

Apache Beam (incubating) defines a new data processing programming model evolved from more than a decade of experience building big data infrastructure within Google. Beam pipelines are portable across open source and private cloud runtimes. Dan Halperin covers the basics of Apache Beam—its evolution, main concepts in the programming mode, and how it compares to similar systems.

Learn stream processing with Apache Beam Tutorial

Tyler Akidau, Slava Chernyak, and Dan Halperin offer a guided walkthrough of Apache Beam (incubating)—the most sophisticated and portable stream processing model on the planet—covering the basics of robust stream processing (windowing, watermarks, and triggers) with the option to execute exercises on top of the runner of your choice (Flink, Spark, or Google Cloud Dataflow).

Hao Hao is a software engineer at Cloudera currently working on the Apache Sentry project, a granular, role-based authorization module for the Hadoop cluster. She is also a PMC of the Apache Sentry (TLP) project. Hao performed extensive research on smartphone security and web security while she was a PhD student at Syracuse University. Prior to joining Cloudera, Hao worked on eBay’s Search Backend team building search infrastructure for eBay’s online buying platform.

Presentations

Authorization in the cloud: Enforcing access control across compute engines Session

Hao Hao and Alex Leblang explore the architecture of Apache Sentry and RecordService (RS) for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access controls, walking you through using Apache Sentry and RS to protect sensitive data on a multitenant cloud across the Hadoop ecosystem.

Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic-net regularization in Spark’s ML library and one-pass elastic-net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine-learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

Presentations

Spark Structured Streaming for machine learning Session

Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark's new Structured Streaming and walk you through creating your own streaming model.

Qirong Ho is vice president of technology at Petuum, Inc., an adjunct assistant professor at the Singapore Management University School of Information Systems, and a former principal investigator at A*STAR’s Institute for Infocomm Research. Qirong’s research focuses on distributed cluster software systems for machine learning at big data and big model scales, with a view toward theoretical correctness and performance guarantees, as well as practical needs like robustness, programmability, and usability. Qirong also works on statistical models for large-scale network analysis and social media, including latent space models for visualization, community detection, user personalization, and interest prediction. He is a recipient of the Singapore A*STAR National Science Search Undergraduate and PhD fellowships and the KDD 2015 Doctoral Dissertation Award (runner up).

Presentations

High-efficiency systems for distributed AI and machine learning at scale Session

When operating on billions of data events per day, modern AI and machine-learning programs require distributed clusters with tens to hundreds machines. Qirong Ho offers an introduction to high-efficiency AI and ML distributed systems developed as part of the Petuum open source project and explains how they can reduce capital and operational costs for businesses.

Office Hour with Qirong Ho (Petuum, Inc.) Office Hours

Interested in optimizing speed and performance for machine learning and artificial intelligence applications or current research trends in machine learning and artificial intelligence? Meet with Qirong.

Juliet Hougland is a data scientist at Cloudera and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil and gas pipelines at Deep Signal and designing and building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in applied mathematics from the University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in math-physics.

Presentations

Guerrilla guide to Python and Apache Hadoop Tutorial

Sean Owen and Juliet Hougland offer a practical overview of the basics of using Python data tools with a Hadoop cluster, covering HDFS connectivity and dealing with raw data files, running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating), and using Apache Spark to write some more-complex analytical jobs.

Andy Huang is a managing consultant in the big data analytics practice at Servian, a leading consulting company in Australia and New Zealand, where he works with clients in telco, banking, and financial services on big data analytics projects. Andy’s project portfolio includes use of Spark for data integration, streaming, and large-scale machine learning. He also leads solution architecture and implementation and evangelizes Apache Spark in the region.

Presentations

Spark foundations: Prototyping Spark use cases on Wikipedia datasets 2-Day Training

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Andy Huang employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

TRAINING: Spark foundations: Prototyping Spark use cases on Wikipedia datasets (Day 2) Training Day 2

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Brian Clapper employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

Mike Jiang is a vice president at is-land Systems Inc. Mike has over 15 years of data analysis software developing experience, especially in semiconductor engineering data. He has led technical teams in providing system integration, professional service, and system development for customers in over 200 projects. His research interests include expert systems, machine learning, parallel and distributed systems, and data mining. He has published 17 international journal and conference papers and 16 workshop papers. Mike is also the author of 23 Chinese programming and software application books.

Presentations

How to apply big data solutions in the semiconductor industry Session

Rebecca Tien Yu Lin and Mon-Fong Mike Jiang offer an overview of a Hadoop-based big data solution helping the semiconductor industry increase yield by monitoring the huge amount of tool logs and the data generated from the FDC system.

Zhiwei Jiang is the global head of the Insights and Data (I&D) practice within Capgemini’s Financial Services global business unit (FSGBU), where he leads the FS I&D global team (2,500+ staff members) in creating an optimized end-to-end data platform, developing strategic capabilities, and building product vendor alliances to help ensure Capgemini is offering relevant and innovative solutions to the global financial services market. Zhiwei has deep domain expertise in big data, data integration, data quality, risk, trading, connectivity, and finance and extensive management and execution experience spanning multiple continents, including Europe, America, and Asia. Previously, Zhiwei was a managing director at Deutsche Bank, managing finance and risk IT in application services. He previously led Deutsche Bank’s connectivity domain globally and managed equities IT for the EMEA and APAC regions. Zhiwei started his career in financial services as a risk programmer with Morgan Stanley in New York. He later joined Goldman Sachs, where he ran Asia equities trading technology based out of Hong Kong and equity derivatives risk management systems in New York. Zhiwei holds an MS in computer science from Rensselaer Polytechnic Institute in New York and is a Chartered Financial Analyst. Currently, he is based in London but travels frequently throughout North America, Europe, and Asia.

Presentations

Making big data work for enterprise ecosystems: Democratizing expertise within software frameworks (sponsored) Session

In the past decade, enterprises have made massive investments in IT to keep pace with increasing data volumes and velocity. Similarly, the tool set to solve use cases for developers, data scientists, and analysts is ever-expanding. Grant Salisbury and Zhiwei Jiang explore how enterprises can create more value from IT investments and how big data can improve decisions across the enterprise.

Steve Jones is Capgemini’s group vice president for big data. Steve focuses on delivering large-scale big data solutions that answer point business challenges. He is the author of Enterprise SOA Adoption Strategies and the creator of the Business Data Lake reference architecture, the first unified approach to big and fast data analytics. Steve was also one of the very first to have integrated Google, Salesforce, and Amazon solutions into traditional enterprises.

Presentations

Stopping your data lake from becoming a swamp (sponsored) Session

Garbage in, garbage out—this truism has become significantly more impactful for big data as companies have moved away from traditional schema-based approaches to more flexible and dynamic file system approaches. Steve Jones explains how to add governance, schema evolution, and the industrialization required to deliver true enterprise-grade big data solutions.

Piotr Kaczmarek is an information designer with over 20 years of experience in qualitative and quantitative data visualization, presented in formats ranging from print and digital displays to interactive animations. He is the coauthor of Visualizing Financial Data, a book about visualization techniques and design principles that includes over 250 visuals depicting quantitative data.

Presentations

Encoding new data visualizations Session

Julie Rodriguez and Piotr Kaczmarek provide a fresh take on data visualizations with an extensive set of case studies that contrast traditional uses of charts with new methods that provide more effective representations of the data to produce greater insights.

Amit Kapoor is interested in learning and teaching the craft of telling visual stories with data. At narrativeVIZ Consulting, Amit uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Amit also teaches storytelling with data as guest faculty in executive courses at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. He has more than 12 years of management consulting experience with AT Kearney in India, Booz & Company in Europe, and more recently for startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi and a PGDM (MBA) from IIM, Ahmedabad. Find more about him at Amitkaps.com.

Presentations

Deep learning for natural language processing Session

Ever wondered how Google Translate works so well, how the autocaptioning works on YouTube, or how to mine the sentiments of tweets on Twitter? What’s the underlying theme? They all use deep learning. Bargava Subramanian and Amit Kapoor explore artificial neural networks and deep learning for natural language processing to get you started.

Machine learning: The power of ensembles Session

Creating better models is a critical component to building a good data science product. It is relatively easy to build a first-cut machine-learning model, but what does it take to build a reasonably good or state-of-the-art model? Ensemble models—which help exploit the power of computing in searching the solution space. Bargava Subramanian discusses various strategies to build ensemble models.

Holden Karau is a software development engineer at IBM and is active in open source. Prior to IBM, she worked on a variety of big data, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. Holden is the author of Learning Spark and has assisted with Spark workshops. She graduated from the University of Waterloo with a bachelors of mathematics in computer science.

Presentations

Office Hour with Holden Karau (IBM) Office Hours

Curious about Structured Streaming or Spark performance in general? Come and chat with Holden about Spark best practices, contributing to Spark, or really anything Spark related.

Spark Structured Streaming for machine learning Session

Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark's new Structured Streaming and walk you through creating your own streaming model.

Nitin Khandelwal works on the Hive stack at Qubole, where he has worked on key features in Hive, including support for encrypted communication, the multitenant Hive tier, and various performance issues. Previously, Nitin worked with Microsoft. Nitin holds a degree from IIIT-Hyderabad.

Presentations

Securing big data on YARN, Hive, and Spark clusters Session

YARN includes security features such as SSL encryption, Kerberos-based authentication, and HDFS encryption. Nitin Khandelwal and Abhishek Modi share the challenges they faced in enabling these features for ephemeral clusters running in the cloud with multitenancy support as well as performance numbers for different encryption algorithms available.

Eiti Kimura is an IT coordinator and architect of distributed and high-performance platforms at Movile Brazil. Eiti has over 15 years of experience working with software development. He is an enthusiast of open technologies—he was an Apache Cassandra MVP from 2014 to 2015—and has vast experience with backend systems for carrier billing services, sending bulk text messages (SMS), and user action tracking. Eiti holds a master’s degree in electrical engineering with a specialization in software engineering.

Presentations

Machine learning in practice with Spark MLlib: An intelligent data analyzer Session

Can you imagine an intelligent software to assist in your decision making and drive actions? Flavio Clesio and Eiti Kimura offer a practical demonstration of using machine learning to create an intelligent monitoring application based on a distributed system data analysis using Apache Spark MLlib.

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.

Presentations

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker provide an introduction to using Impala + Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

John Kreisa leads international marketing for Hortonworks, where he is responsible for all marketing activities across Europe and Asia. John has been with Hortonworks for more than four years and has been on the inside through massive growth in the big data market, including Apache Hadoop and related technologies. With more than 20 years of marketing experience across storage, analytics, and big data, John has extensive experience helping organizations understand the benefits of technologies.

Presentations

Case studies of business transformation through big data Session

The opportunity to harness data to impact business is ripe, and as a result, every industry, every organization, and every department is going through a huge change, whether they realize it or not. John Kreisa shares use cases from across Asia and Europe of businesses that are successfully leveraging new platform technologies to transform their organizations using data.

Aljoscha Krettek is a PMC member at Apache Flink, where he mainly works on the Streaming API and also designed and implemented he most recent additions to the windowing and state APIs. Aljoscha is a cofounder and software engineer at data Artisans. Previously, he worked at IBM Germany and at the IBM Almaden Research Center in San Jose. Aljoscha has spoken at Hadoop Summit, Flink Forward, and several meetups about stream processing and Apache Flink. He studied computer science at TU Berlin.

Presentations

Robust stream processing with Apache Flink Session

Aljoscha Krettek offers a very short introduction to stream processing before diving into writing code and demonstrating the features in Apache Flink that make truly robust stream processing possible. All of this will be done in the context of a real-time analytics application that we'll be modifying on the fly based on the topics we're working though.

Chi-Yi Kuan is director of business analytics at LinkedIn. He has over 15 years of extensive experience in applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Presentations

Understanding the voice of members via text mining: How Linkedin built a text analytics engine at scale Session

Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang explain how LinkedIn has built a "voice of member" platform to analyze hundreds of millions of text documents. Chi-Yi, Weidong, and Yongzheng illustrate the critical components of this platform and showcase how LinkedIn leverages it to derive insights such as customer value propositions from an enormous amount of unstructured data.

Shameek Kundu is Chief Data Officer (CDO) for Standard Chartered Bank globally. He is based in Singapore and is part of SCB’s global Leadership Team, and IT/ Operations Management team.
As CDO, Shameek is responsible for building the technology underpinning data and analytics (including Hadoop/ big data, traditional data warehouses and BI/ analytics) and the governance of data quality. The CDO team plays a key role in incubating and/ or enabling analytics initiatives in different parts of the business.
Since May 2016, Shameek has also taken on the additional role of Global Head, Architecture and Innovation. In this expanded role, Shameek is responsible for the bank’s ‘eXellerator’ lab in Singapore. The eXellerator acts as a platform to speed up the development and testing of ideas, through design thinking and rapid proto-typing.
Prior to joining SCB in 2009, Shameek was an Associate Partner in McKinsey & Company in their London office. Shameek holds an engineering degree from NIT Warangal and an MBA from IIM Calcutta.

Presentations

A day in the life of a chief data officer (sponsored) Session

Join in to meet four experts who will share their views of the people, processes, and technologies that are driving information transformation around the world, including machine learning, big data, the cloud, and distributed computing. Find out why the role of chief data officer is at the center of driving tangible business value from data across the enterprise.

Scott Kurth is the vice president of advisory services at Silicon Valley Data Science, where he helps clients define and execute the strategies and data architectures that enable differentiated business growth. Building on 20 years of experience making emerging technologies relevant to enterprises, he has advised clients on the impact of technological change, typically working with CIOs, CTOs, and heads of business. Scott has helped clients drive global technology strategy, conduct prioritization of technology investments, shape alliance strategy based on technology, and build solutions for their businesses. Previously, Scott was director of the Data Insights R&D practice within the Accenture Technology Labs, where he led a team focused on employing emerging technologies to discover the insight contained in data and effectively bring that insight to bear on business processes to enable new and better outcomes and even entire new business models, and led the creation of Accenture’s annual analysis of emerging technology trends impacting the future of IT, Technology Vision, where he was responsible for tracking emerging technologies, analyzing their transformational potential, and using it to influence technology strategy for both Accenture and its clients.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and John Akred explain how to create a modern data strategy that powers data-driven business.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. Jared is the organizer of the New York Open Statistical Programming Meetup and the New York R Conference, as well as an adjunct professor of statistics at Columbia University, in addition to his client-facing consulting and training. Jared specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world. He was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Analyzing NFL play-by-play data Session

Jared Lander worked with the Minnesota Vikings to bring moneyball to football for the 2015 NFL draft. Join Jared as he dives further into football, using statistical modeling and R to analyze opponent play-calling, examine when the New York Giants will run or pass the ball, and discern quarterback Eli Manning's favorite receivers.

Alex Leblang is an engineer at Cloudera on the RecordService team. Previously, Alex was an Apache Impala (incubating) engineer and interned at Vertica. He holds a bachelor’s degree from Brown University with concentrations in computer science and Latin American studies.

Presentations

Authorization in the cloud: Enforcing access control across compute engines Session

Hao Hao and Alex Leblang explore the architecture of Apache Sentry and RecordService (RS) for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access controls, walking you through using Apache Sentry and RS to protect sensitive data on a multitenant cloud across the Hadoop ecosystem.

Brandon Lee is assistant vice president and senior data scientist at Mediacorp, where his research focuses on processing methods for user profiling and end-to-end productization of Mediacorp’s big data analytics platform. Brandon has spent more than 20 years working in the data science and research fields, working with Fortune 500 companies and startups. Previously, at Samsung R&D, he introduced an award-winning big data framework, based on Hadoop and MapReduce, to implement distributed machine-learning algorithms for the electronics semiconductor business.

Presentations

How Mediacorp has leveraged Apache Spark and Microsoft Cloud to analyze patterns of user behavior for actionable insights Session

Mediacorp analyzes its online audience through a computationally and economically efficient cloud-based platform. The cornerstone of the platform is Apache Spark, a framework whose clean APIs and performance gains make it an ideal choice for data scientists. Andrea Gagliardi La Gala and Brandon Lee highlight the platform’s architecture, benefits, and considerations for deploying it in production.

M.Sc. in Math (Theoretical Computer Science)
2 years of Game Technical Operations experience
start to work on hadoop+spark cluster since mid 2016

Presentations

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Session

Real-time data analysis is becoming more and more important to Internet companies’ daily business. Qunar has been running Alluxio in production for over a year. Lei Xu explores how stream processing on Alluxio has led to a 16x performance improvement on average and 300x improvement at service peak time on workloads at Qunar.

Xueyan Li is a data platform R&D engineer at Qunar, where he is mainly responsible for the continuous integrated development of resource management system Mesos and distributed memory management system Alluxio, as well as data for all business lines based on public service support. Other focuses include the ELK log ETL platform, Spark, Storm, Flink, and Zeppelin. He holds a degree in software engineering from Heilongjiang University.

Presentations

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Session

Real-time data analysis is becoming more and more important to Internet companies’ daily business. Qunar has been running Alluxio in production for over a year. Lei Xu explores how stream processing on Alluxio has led to a 16x performance improvement on average and 300x improvement at service peak time on workloads at Qunar.

Presentations

Evolution of big data analytics (sponsored) Session

Organizations used to store information in separate silos. As a result, searching for the data you needed was a difficult affair. KC Wong explores big data analytics (BDA) platforms that can produce what you need in a much shorter timeframe and are even intelligent enough to present exactly what you need for greater efficiency, productivity, and profits.

Rebecca Tien Yu Lin is a director at is-land Systems Inc., where she leads the application team in helping customers establish big data systems and introduce HareDB solutions to their sites. Rebecca draws on her strong knowledge of the Hadoop ecosystem and its applications to provide professional services for her customers. Rebecca has more than eight years’ experience in project execution related to the semiconductor industry and has demonstrated innovative professional skill with a proven ability to identify, analyze, and solve problems to increase customer satisfaction.

Presentations

How to apply big data solutions in the semiconductor industry Session

Rebecca Tien Yu Lin and Mon-Fong Mike Jiang offer an overview of a Hadoop-based big data solution helping the semiconductor industry increase yield by monitoring the huge amount of tool logs and the data generated from the FDC system.

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine-learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Presentations

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker provide an introduction to using Impala + Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

Audrey Lobo-Pulo is a co-founder of Phoensight and an advocate for open government and the use of open source software in government modeling. A physicist working in high-speed data transmission, Audrey began working in economic policy modeling after joining the Australian Public Service and has since been involved in modeling a wide range of economic policy options in personal taxation, housing, pensions, superannuation, labor force, and population demographics. More recently in 2015, Audrey presented internationally on government open source models (GOSMs) and is currently involved in bringing data science to public policy analytics.

Presentations

Government open data: Tales from a deep dive into CKAN Tutorial

In early 2016, a team set out to score the usability of government open data across 5 countries. What was to be a small-scale project giving a data-driven picture of the supply side of open data grew into a lengthy, all-consuming quest to decipher the depths of government CKAN repositories. Audrey Lobo-Pulo shares the team's findings and explores the future possibilities of open data.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday opening welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Nir Lotan is a machine-learning product manager and team manager in Intel’s Advanced Analytics department. Nir’s team develops machine-learning and deep learning-related tools, including a tool that enables easy creation of deep learning models. Prior to this role, Nir held several product, system, and software management positions within Intel’s Design Center organization and other leading companies. Nir has 15 years of experience in software and systems engineering, products, and management. He holds a BSc degree in computer engineering from the Technion Institute of Technology.

Presentations

Fast deep learning at your fingertips Session

Nir Lotan describes a new, free software tool based on existing deep learning frameworks that enables the fast and easy creation of deep learning models and incorporates extensive optimizations that provide high performance on standard CPUs.

Lyudmila Lugovskaya is a data scientist at Lloyds Bank. Lyudmila has an interdisciplinary background in business, finance, and psychology and works to apply decision and data science to real-world problems and help businesses make better decisions by unlocking the potential of data. Prior to Lloyds Bank, Lyudmila spent a few years in the financial industry, specializing in corporate credit risk with a particular focus on emerging markets, and taught finance to university students. Lyudmila holds a PhD from the University of Cambridge, where her doctoral research was devoted to predicting the default of small and medium-size enterprises on the basis of financial and nonfinancial variables, as well as an MSc in psychology.

Presentations

The 12 stations of becoming a data-centric organization Tutorial

Lyudmila Lugovskaya and Stuart Coleman discuss some of the many challenges that organizations face on their journey to become data-centric and share lessons learned from their experience doing and promoting data science within organizations of different type and size while dealing with restrictions imposed by traditional governance structures and policies.

Donald MacDonald leads OCBC Group’s big data analytics, data-driven marketing, and overall CRM architecture across Asia, where he has responsibility for driving business value from data by providing actionable insights to segment, channel, and product managers across all geographies and business divisions. Donald manages a large marketing analytics center of excellence supporting all OCBC Group companies including banking, insurance, share trading, private banking, fintech, and external third parties. Capabilities within the CoE include data science, campaign decisioning, self-service analytics, operational CRM, and HR analytics. Donald also manages OCBC’s Operational CRM platform and implemented the bank’s award-winning next-generation unified sales desktop (ROME). Donald has over 20 years of international experience in analytics, helping companies to generate actionable insights and increase revenues. Prior to joining OCBC, he worked at IBM and PwC Consulting, delivering data-driven solutions globally. He began his career at Standard Life Group in the UK building and leading its Data Mining team.

Presentations

How OCBC sold its big data vision Keynote

OCBC has long been recognized as a leader in traditional analytics within Southeast Asia, investing over $100M to build its capability. Donald MacDonald shares how OCBC determined that what had worked in the past may not succeed in the future and how it built support for its next-generation platform.

Mark Madsen is a research analyst at Third Nature, where he advises companies on data strategy and technology planning. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide. He focuses on two types of work: the business applications of data and guiding the construction of data infrastructure. As a result, Mark does as much information strategy and IT architecture work as he does performance management and analytics.

Presentations

Dealing with device data Session

In 2007, a computer game company decided to jump ahead of competitors by capturing and using data created during online gaming, but it wasn't prepared to deal with the data management and process challenges stemming from distributed devices creating data. Mark Madsen shares a case study that explores the oversights, failures, and lessons the company learned along its journey.

Organizing the data lake Session

Building a data lake involves more than installing and using Hadoop. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen discusses hidden design assumptions, reviews design principles to apply when building multiuse data infrastructure, and provides a reference architecture.

Ted Malaska is a senior solution architect at Blizzard. Previously, he was a principal solutions architect at Cloudera. Ted has 18 years of professional experience working for startups, the US government, some of the world’s largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the largest nonprofit financial regulator in the US and has worked on close to one hundred clusters for over two dozen clients with over hundreds of use cases. He has architecture experience across topics including Hadoop, Web 2.0, mobile, SOA (ESB, BPM), and big data. Ted is a regular contributor to the Hadoop, HBase, and Spark projects, a regular committer to Flume, Avro, Pig, and YARN, and the coauthor of Hadoop Application Architectures.

Presentations

Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing Tutorial

Mark Grover, Ted Malaska, and Jonathan Seidman explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world and discuss how to use components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures AMA

Mark Grover, Jonathan Seidman, and Ted Malaska, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Storage designs done right equal faster processing and access Session

If your design only focuses on the processing layer to get speed and power, you may be leaving a significant amount of optimization untapped. Ted Malaska describes a set of storage design patterns and schemas implemented on HBase, Kudu, Kafka, SolR, HDFS, and S3 that, by carefully tailoring how data is stored, can reduce processing and access times by two to three orders of magnitude.

Top five mistakes when writing Spark applications Session

Ted Malaska and Mark Grover cover the top five things that prevent Spark developers from getting the most out of their Spark clusters. When these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters and the same data, using just a different approach.

Verdi March is the chief research scientist with Deep Labs, specializing in parallel and distributed computing. Verdi has been applying his expertise to develop high-performance, scalable systems and applications across various domains and has extensive experience in the industrial R&D life-cycle at various Fortune 500 companies. Prior to joining Deep Labs, Verdi was a lead research scientist with Visa Labs, where he drove innovations on data science and next-generation big data platforms for payment analytics and risk managements. Previously, Verdi was with HP Labs Singapore, where he focused on cloud computing, and Sun Microsystems, where he focused on HPC/supercomputing. Verdi holds a PhD in computer science from the National University of Singapore and a bachelor of science in computer science from the University of Indonesia.

Presentations

Experience in adopting deep learning into existing software development practices Session

Verdi March demystifies deep learning and shares his experience on how to gradually transition to deep learning. Using a specific example in computer vision, Verdi touches upon key differences in engineering traditional software versus deep learning-based software.

Jennifer Marsman is a principal developer evangelist in Microsoft’s Developer and Platform Evangelism group, where she educates developers on Microsoft’s new technologies. In this role, Jennifer is a frequent speaker at software development conferences across the United States. Previously, Jennifer was a software developer in Microsoft’s Natural Interactive Services division, where she authored two patents for work on search and data mining algorithms. Jennifer has also held positions with Ford Motor Company, National Instruments, and Soar Technology.

In 2009, Jennifer was chosen as Techie Whose Innovation Will Have the Biggest Impact by X-ology for her work with GiveCamps, a weekend-long event where developers code for charity. She has also received many honors from Microsoft, including the Central Region Top Contributor award, Heartland District Top Contributor award, DPE Community Evangelist award, CPE Champion award, MSUS Diversity & Inclusion Award, and Gold Club. Jennifer holds a bachelor’s degree in computer engineering and a master’s degree in computer science and engineering from the University of Michigan in Ann Arbor, where her graduate work specialized in artificial intelligence and computational theory. Jennifer writes on her blog and tweets at @JenniferMarsman.

Presentations

Bots as the next UX: Expanding your apps with conversation Session

Matt Winkler and Jennifer Marsman explain how to easily extend your apps and services with bots to reach users where they are—in messaging apps—covering use cases and case studies, how to quickly get started building a bot, how to process input using linguistic analysis, and how to deploy and integrate bots with messaging apps.

Dancing with intelligent dragon drones Session

Jennifer Marsman, Ranveer Chandra, and Wee Hyong Tok explore the various drone technologies that are currently available and explain how to acquire and analyze real-time signals from drones to design intelligent applications.

Alyona Medelyan has been working on algorithms that make sense of language data for over a decade. Her passion lies in helping businesses to extracting useful knowledge from text. As part of her PhD she has proven that her open source algorithm, Maui, can be as accurate as people at finding keywords. She has worked with large multinationals like Cisco and Google, has lead R&D teams and consulted to small and large companies around the globe. Alyona now runs Thematic, a customer insight company.

Presentations

Applications of natural language understanding: Tools and technologies Session

With the rise of deep learning, natural language understanding techniques are becoming more effective and are not as reliant on costly annotated data. This leads to an explosion of possibilities of what businesses can do with language. Alyona Medelyan explains what the newest NLU tools can achieve today and presents their common use cases.

Office Hour with Alyona Medelyan (Entopix) Office Hours

Looking for advice on text analysis and practical applications of Natural Language Understanding? Alyona can help.

Abhishek Modi works on Hadoop and YARN stack at Qubole, where he has worked on key features in YARN like its autoscaling framework and balancing of spot nodes in cluster. Previously, he worked with Adobe Systems, where, during his tenure, he filed multiple patents. Abhishek holds a degree from IIT-Varanasi.

Presentations

Securing big data on YARN, Hive, and Spark clusters Session

YARN includes security features such as SSL encryption, Kerberos-based authentication, and HDFS encryption. Nitin Khandelwal and Abhishek Modi share the challenges they faced in enabling these features for ephemeral clusters running in the cloud with multitenancy support as well as performance numbers for different encryption algorithms available.

Raghunath Nambiar is the CTO for Cisco UCS, where he helps define strategies for next-generation architectures, systems, and data center solutions and leads a team of engineers and product leaders focused on emerging technologies and solutions. Raghu’s current focus areas include emerging technologies, data center solutions, and big data and analytics strategy. He is Cisco’s representative for standards bodies for system performance, has served on several industry standard committees for performance evaluation and program committees of leading academic conferences, and chaired the industry’s first standards committee for benchmarking big data systems. Raghu has years of technical accomplishments with significant expertise in system architecture, performance engineering, and creating disruptive technology solutions. He is a member of the IEEE big data steering committee, serves board of directors of the Transaction Processing Performance Council (TPC), and is founding chair of its International Conference Series on Performance Evaluation and Benchmarking. He has published 50+ peer-reviewed papers and book chapters. Raghu holds master’s degrees from University of Massachusetts and Goa University and completed an advanced management program from Stanford University.

Presentations

Accelerating time to value at petascale with Cisco UCS (sponsored) Session

Raghunath Nambiar explores the architectural components of building large-scale big data systems with Cisco UCS, as well as use cases, lessons learned, best practice guidelines, and Cisco Validated Designs.

Prakash Nanduri is the CEO and cofounder of Paxata. Prakash is a seasoned entrepreneur and enterprise software visionary with over 20 years experience in both startups and large companies. Previously, he was the cofounder and vice president of Velosel Corporation (acquired by TIBCO in 2005), a pioneer of the master data management (MDM) space, where he drove the strategy and growth of the company, securing over $40 million in venture capital financing, recruiting a senior management team, and serving on the board of directors. Prakash led the postmerger integration effort at TIBCO before spending three years at SAP as the head of product and technology strategy within the Office of the CEO, where he was responsible for key strategic initiatives, including the SAP Big Data (Hana) business strategy.

Presentations

Information at the speed of thought (sponsored) Keynote

Today, we are able to collect, manage, and query data with very advanced data processing frameworks like Apache Hadoop in both on-premises and cloud deployments, yet turning data into trustworthy information is one of the toughest challenges facing businesses. Prakash Nanduri explains how to deal with the challenge of data swamps of #deplorable data.

Vijay Narayanan heads the algorithms and data science efforts in the Data group in Microsoft, where he works on building and leveraging machine-learning platforms, tools, and solutions to solve analytic problems in diverse domains. Previously, Vijay was a principal scientist at Yahoo Labs, where he worked on building cloud-based machine-learning applications in computational advertising; an analytic science manager in FICO, where he worked on launching a product to combat identify theft and application fraud using machine learning; a modeling researcher at ACI Worldwide; and a Sloan Digital Sky Survey research fellow in astrophysics at Princeton University, where he codiscovered the ionization boundary and the four farthest quasars in the universe. Vijay has authored or coauthored approximately 55 peer-reviewed papers in astrophysics, 10 papers in machine-learning and data mining techniques and applications, and 15 patents (filed or granted). He is deeply interested in the theoretical, applied and business aspects of large-scale data mining and machine learning and has indiscriminate interests in statistics, information retrieval, extraction, signal processing, information theory, and large-scale computing. Vijay holds a bachelor of technology degree from IIT, Chennai and a PhD in astronomy from the Ohio State University.

Presentations

The ACID revolution (sponsored) Keynote

Vijay Narayanan explains how rapid advances in algorithms, the cloud, the Internet of Things, and data are driving unimaginable breakthroughs in every human endeavor, across agriculture, healthcare, education, travel, smart nations, and more.

Paco Nathan leads the Learning group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

Computable content: Notebooks, containers, and data-centric organizational learning Session

O'Reilly recently launched Oriole, a new learning medium for online tutorials that combines Jupyter notebooks, video timelines, and Docker containers run on a Mesos cluster, based the pedagogical theory of computable content. Paco Nathan explores the system architecture, shares project experiences, and considers the impact of notebooks for sharing and learning across a data-centric organization.

Office Hour with Shannon Cutt and Paco Nathan (O'Reilly Media) Office Hours

Have you always wanted to become an author? Shannon and Paco will talk with you about upcoming ideas and projects that O'Reilly is looking to publish.

Chris Neumann is a Venture Partner at 500 Startups, focused on big data, machine learning and AI. He was previously the founder and CEO of DataHero (acquired by Cloudability), which brought to market the first self-service cloud BI platform, and the first employee at Aster Data (acquired by Teradata), where he helped create the big data space.

Presentations

The fallacy of the subject-matter expert Session

For decades, business intelligence companies have strived to make their products easier to use in the hope that they could finally reach the mythical subject-matter expert—that wondrous individual who would change the course of the company if only she had access to the data she needed. Drawing on his real-world experience, Chris Neumann asks, "What if the subject-matter expert doesn’t exist?"

Wei Keong is the Director for Business Solutions of Fusionex. With more than a decade’s experience in transforming enterprises into modern workspaces, Wei Keong has converged the huge elements of Big Data Analytics, Business Intelligence, and the Internet of Things into day-to-day productivity suites for various industries across the region.
He believes in humanizing internal, external, and Open Data to produce actionable and business transforming decisions. These will create an ecosystem of blue lake strategies and one day in the future, even become a blue ocean

Presentations

A smarter ecosystem through big data analytics (sponsored) Keynote

Elevating the intelligence of the whole ecosystem of things is imperative to ensure the proliferation of IoT devices doesn't result in disparate, detached machinery. Isaac Jacob explains how Fusionex intends to achieve this by leveraging robust, intuitive BDA solutions with real-time capabilities.

Minh Chau Nguyen is a researcher in the Big Data Software Platform Research department at the Electronic and Telecommunications Research Institute (ETRI), one of the largest government-funded research institutes in Korea. His research interests include big data management, software architecture, and distributed systems.

Presentations

Unified metadata management for scalability, integrity, and reliability across geographically distributed data centers Session

Minh Chau Nguyen and Hee Sun Won demonstrate how all metadata from system, service, and user can be managed in one unified platform across many geographically distributed data centers by extending the overall architecture of the Hadoop ecosystem so that multiple tenants and authorized third parties can securely access and modify the metadata in runtime via a so-called metadatabase.

Takayuki Nishikawa is a data scientist at Panasonic Corporation working on developing a big data analytics platform to analyze IoT home appliances logs for the purpose of designing new products or functions, creating new services, maintaining appliances, and so on. In addition, Takayuki shares his knowledge about data analysis using machine learning in the company. His academic background is in artificial intelligence, and he holds a master’s degree in computer science from Osaka University in Japan. In his free time, Takayuki enjoys DIY electronics as a Maker.

Presentations

Integrated data analytics for consumer electronics using Hadoop and Spark MLlib Session

Takayuki Nishikawa and Ei Yamaguhi explain how Panasonic developed an integrated data analytics platform to analyze the increasing number of home appliances logs from its IoT products, achieving scalability for millions of households and a 10x improvement in processing time with Hadoop and Hive, in the process gaining more reliable knowledge about users’ lifestyles with Spark MLlib.

Patrick Nord is the director of big data, analytics, and insights for Archetype SC, a bespoke IT consultancy based out of Myrtle Beach, South Carolina, where he leads a team of consultants providing data-driven solutions to clients ranging from SMB to Fortune 500 companies, including numerous banks, airlines, manufacturers, marketing agencies, and insurance companies. Patrick is an accomplished analyst and consultant, having provided services to government entities and Fortune 500 companies across industry verticals. His particular expertise is operationalizing data—turning disparate data sources into cogent and actionable insights. The insights and intelligence he has provided are directly responsible for tens of millions of dollars in increased revenue and even more in cost savings. Patrick has presented at numerous IBM events, including most recently as the keynote speaker for IBM’s Datapalooza Denver.

Presentations

Data distillation: Applying design principles to reporting, KPI, and dashboards Session

Big data makes big promises, but when business users don't pay attention, your work and insights are wasted. Patrick Nord explains how to use a methodology developed by data scientists and designers to help you efficiently and effectively communicate your transformative insights—even to unreceptive executive teams.

Michael O’Connell is the chief analytics officer at TIBCO Software, where he develops analytic solutions across a number of industries, including financial services, energy, life sciences, consumer goods and retail, and telco, media, and networks. Michael has been working on statistical software applications for the past 20 years and has published more than 50 papers and several software packages on statistical methods. Michael holds a PhD in statistics from North Carolina State University, where he is an adjunct professor of statistics.

Presentations

Augmenting intelligence in an interconnected world Session

The interconnected world presents unprecedented opportunities to gain new insights on behavior, both human and nonhuman alike. Likewise, it also poses unprecedented challenges on how organizations can act on these moments of opportunities in time. Michael O'Connell and San Zaw share real-world case studies demonstrating how real-time analytics solves these challenges.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Big data, big value for smart banking at DBS Keynote

Mike Olson explores the latest trends on how organizations are using big data to drive board-level business decisions. Mike will be joined by Dave Gledhill, who will share how DBS is using big data and customer 360 to improve customer experience and drive ATM network and customer call center operations efficiency.

Jingwen Ouyang is a staff big data developer at SanDisk, a Western Digital Brand. Coming from circuit design, Jingwen is uniquely positioned to bridge semiconductor manufacturing processes with big data platforms. Jingwen holds a BS and MEng from the Massachusetts Institute of Technology.

Presentations

Big data solutions for analyzing chip DNA in semiconductor manufacturing Tutorial

In semiconductor manufacturing, creating a high-yield process where sufficient portions of chips pass acceptance testing is extremely difficult to achieve. Data is collected and analyzed at every stage to improve yield and productivity. Amit Rustagi and Jingwen Ouyang share a Hadoop-based solution that reveals the true value and benefits of manufacturing data generated about every chip.

Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of Advanced Analytics on Spark and Mahout in Action. Previously, Sean was a senior engineer at Google.

Presentations

Guerrilla guide to Python and Apache Hadoop Tutorial

Sean Owen and Juliet Hougland offer a practical overview of the basics of using Python data tools with a Hadoop cluster, covering HDFS connectivity and dealing with raw data files, running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating), and using Apache Spark to write some more-complex analytical jobs.

Jorge Pablo is the head of data Hadoop applications on the Data Innovation team at Isban UK (Santander), where he is responsible for building the development team for Hadoop and bringing new technologies and methodologies to Santander UK.

Presentations

Support digital applications with a resilient, highly available, and NRT Hadoop backend Session

Jorge Pablo Fernandez and Nicolette Bullivant explore Santander Bank's Spendlytics app, which helps customers track their spending by offering a listing of transactions, transactions aggregations, and real-time enrichment based on the categorization of transactions depending on market and brands. Along they way, they share the challenges encountered and lessons learned while implementing the app.

Presentations

Fast cars, big data: The Internet of Formula 1 Things Tutorial

Modern cars produce data. Lots of data. And Formula 1 cars produce more than their fair share. Ted Dunning presents a demo of how data streaming can be applied to the analytics problems posed by modern motorsports. Although he won't be bringing Formula 1 cars to the talk, Ted demonstrates a physics-based simulator to analyze realistic data from simulated cars.

Vivian Peng is a visual artist using design, animations, and data for storytelling. Vivian is currently a communications officer at Doctors Without Borders, using design to raise awareness on public health issues. She holds a master’s degree in public health informatics from Columbia University.

Presentations

The feels: How to design data visualizations that evoke an emotion from your users Session

When data is transformed into visualizations, the impact can sometimes be lost on the user. Drawing on her work with Doctors Without Borders, Vivian Peng explains how emotions help convey impact and move people to take action and demonstrates how we might design emotions into the data visualization experience.

Dnyanesh Prabhu is the enterprise data architect at SKY TV NZ. Dnyanesh started various data initiatives in SKY, including the big data initiative that will help transform SKY into a data-driven organization. Dnyanesh has a passion for data management and his clients and organizations for whom he’s worked have benefitted from his data strategies, data roadmaps, BI health checks, BIBPs (business intelligence blueprints), master data management, metadata management, data quality, data migration, and data governance solutions. He has more than 15 years of experience in consulting, managing, designing, and architecting data solutions for multinational corporations like Warner Brothers, 20th Century Fox, Aviva, and Westpac. Dnyanesh is a TOGAF-certified enterprise architect and a Teradata certified master.

Presentations

Act on insight with the IoT Session

Embedding operational analytics with the IoT enables organizations to act on insights in real time. Devin Deen and Dnyanesh Prabhu walk you through examples from Sky TV and NZ Bus—two businesses that iteratively developed their analytic capabilities integrating the IoT on Hadoop, allowing people and process changes to keep pace with technical enablement.

Phillip Radley is chief data architect on BT’s core Enterprise Architecture team, where he is responsible for data architecture across BT Group Plc. Based at BT’s Adastral Park campus in the UK, Phill currently leads BT’s MDM and big data initiatives, driving associated strategic architecture and investment roadmaps for the business. Phill has worked in IT and the communications industry for 30 years, mostly with British Telecommunications Plc., and his previous roles in BT include nine years as chief architect for infrastructure performance-management solutions from UK consumer broadband to outsourced Fortune 500 networks and high-performance trading networks. He has broad global experience, including with BT’s Concert global venture in the US and five years as an Asia Pacific BSS/OSS architect based in Sydney. Phill is a physics graduate with an MBA.

Presentations

Hadoop as a service at BT: How to build a successful enterprise data hub Session

If your organization has Hadoop clusters in research or as point solutions and you're wondering where you go from there, this session is for you. Phillip Radley explains how to run Hadoop as a service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably.

Office Hour with Phillip Radley (BT) Office Hours

Do you want to persuade finance to fund a Hadoop cluster? Educate designers & architects to use Hadoop in their solutions? Get a data team to run Hadoop as a shared service? Democratize your data? Stop by and find out how Phillip did it, he’s got some great ideas for you.

Presentations

A day in the life of a chief data officer (sponsored) Session

Join in to meet four experts who will share their views of the people, processes, and technologies that are driving information transformation around the world, including machine learning, big data, the cloud, and distributed computing. Find out why the role of chief data officer is at the center of driving tangible business value from data across the enterprise.

Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.

Presentations

BI and SQL analytics with Hadoop in the cloud Session

Alex Gutow and Henry Robinson explain how Apache Hadoop and Apache Impala (incubating) take advantage of the benefits of the cloud to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments combined with the flexibility and cost efficiency of the cloud.

Julie Rodriguez is associate creative director at Sapient Global Markets. Julie is an experience designer focusing on user research, analysis, and design for complex systems. Julie has patented her work in data visualizations for MATLAB, compiled a data visualization pattern library, and publishes industry articles on user experience and data analysis and visualization. She is the coauthor of Visualizing Financial Data, a book about visualization techniques and design principles that includes over 250 visuals depicting quantitative data.

Presentations

Encoding new data visualizations Session

Julie Rodriguez and Piotr Kaczmarek provide a fresh take on data visualizations with an extensive set of case studies that contrast traditional uses of charts with new methods that provide more effective representations of the data to produce greater insights.

What if data had personality? Keynote

Julie Rodriguez discusses the idea of visualizing data to render its individual characteristics while simultaneously encapsulating and projecting out the personality of data.

Ofer Ron is a principal data scientist and architect at LivePerson, where he works on conceptualizing data products, researching them, and getting them deployed and running in production.

Presentations

Concepts before machinery: Harnessing the power of domain expertise for machine-learning-based solutions Session

Ofer Ron examines the development of LivePerson's traffic targeting solution from a generic to a domain-specific implementation to demonstrate that a thorough understanding of the problem domain is essential to a good machine-learning-based product. Ofer then reviews the underlying architecture that makes this possible.

Amit Rustagi is an architect at SanDisk, where he is leading the architecture and strategy for big data solutions. Previously, Amit was an architect at Intuit, where he led the design and strategy of its Financial Aggregation Platform; a principal architect at eBay, where he led the architecture of analytics and experimentation infrastructure; and a senior principal architect for analytics products at Yahoo. He also held a lead role working on Oracle applications at Oracle Corp. Amit has a BS in electronics and communications.

Presentations

Big data solutions for analyzing chip DNA in semiconductor manufacturing Tutorial

In semiconductor manufacturing, creating a high-yield process where sufficient portions of chips pass acceptance testing is extremely difficult to achieve. Data is collected and analyzed at every stage to improve yield and productivity. Amit Rustagi and Jingwen Ouyang share a Hadoop-based solution that reveals the true value and benefits of manufacturing data generated about every chip.

Frank Säuberlich is Teradata’s director of data science, a role that demands combining demand generation across EMEA and APAC with analytical innovation. Previously, Frank worked at Urban Science International, where he was regional manager responsible for customer analytics working with client teams around the globe to implement analytical solutions and pioneering new types of analysis to improve the efficiency of automotive clients’ marketing efforts, as well as a European customer solutions practice manager responsible for the Urban Science Customer Solutions practice in Europe. Before joining Urban Science, Frank was a senior technical consultant in data mining at SAS, Heidelberg, Germany, where he focused on e-intelligence and CRM in addition to data mining topics. Frank is a member of the German Classification Society. He holds a PhD in economics and a master’s degree in economic mathematics from the University of Karlsruhe.

Presentations

Making sense of the sensors: Connecting the IoT and analytics Session

Combining the power of the IoT and big data analytics opens the doors to a wide range of opportunities for organizations to solve new challenges that create an impact on the world that we live in. Frank Saeuberlich and Karthik Thirumalai explain why data management, data integration, and multigenre analytics are foundational to driving business value from IoT initiatives.

Grant Salisbury is GoodData’s head of sales engineering and business architecture, where he and his team work with global companies to create collaborative data products and smart business applications that drive innovation across their business ecosystems. Formerly an executive at a multinational NGO leading postconflict reconstruction programs, Grant draws on his leadership experience to help GoodData grow its business. He has extensive experience launching, scaling, and running data-driven global projects and has held senior roles where he was responsible for managing $10+ million dollar budget projects with a more than 1,000-person staff located across 10 countries (which including three launches into new markets).

Presentations

Making big data work for enterprise ecosystems: Democratizing expertise within software frameworks (sponsored) Session

In the past decade, enterprises have made massive investments in IT to keep pace with increasing data volumes and velocity. Similarly, the tool set to solve use cases for developers, data scientists, and analysts is ever-expanding. Grant Salisbury and Zhiwei Jiang explore how enterprises can create more value from IT investments and how big data can improve decisions across the enterprise.

Rajesh Sampathkumar is senior consultant at the Data Team, a strategy consulting organization focused on big data, data analytics, and data science, where he works with clients in diverse industries to provide data science expertise relevant to their business and decision making. Rajesh has many years of experience in consulting, design, and engineering at a number of reputed organizations.

Presentations

A survey of time series analysis techniques for sensor data Session

One challenge when dealing with manufacturing sensor data analysis is to formulate an efficient model of the underlying physical system. Rajesh Sampathkumar shares his experience working with sensor data at scale to model a real-world manufacturing subsystem with simple techniques, such as moving average analysis, and advanced ones, like VAR, applied to the problem of predictive maintenance.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies, Inc. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for six years.

Presentations

Evolving beyond the data lake Session

Everyone is talking about data lakes. The intended use of a data lake is as a central storage facility for performing analytics. But, Jim Scott asks, why have a separate data lake when your entire (or most of your) infrastructure can run directly on top of your storage, ​minimizing or ​eliminating the need for data movement, separate​ processes and​ clusters​,​ and ETL?

Evolving from RDBMS to NoSQL + SQL Session

Application developers have long created complex schemas to handle storing with minor relationships in an RDBMS. This talk will show how to convert an existing (complicated schema) music database to HBase for transactional workloads, plus how to use Drill against HBase for real-time queries. HBase column families will also be discussed.

Boon Siew Seah is the head of data engineering and innovation at SmartHub (part of StarHub Ltd.), where he is responsible for driving data innovation through telco data and big data technologies to deliver unique telco analytics products. Prior to joining StarHub, Boon Siew was the data science guild lead in the Data Science division of Infocomm Development Authority (IDA) in Singapore.

Presentations

From telco data to real-world data analytics products at SmartHub Session

Translating streaming, real-time telecommunications data into actionable analytics products remains challenging. Boon Siew Seah explores SmartHub’s past successes and failures building telco analytics products for its customers and shares the big data technologies behind its two API-based telco analytics products: Grid360 (geolocation analytics) and C360 (consumer insights).

Jonathan Seidman is a software engineer on the Partner Engineering team at Cloudera. Previously, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing Tutorial

Mark Grover, Ted Malaska, and Jonathan Seidman explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world and discuss how to use components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures AMA

Mark Grover, Jonathan Seidman, and Ted Malaska, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Chandra Sekhar Saripaka is a product developer, big data professional, and data scientist at DataSpark in Singtel. He has a deep experience in financial products, CMS, and identity management and is an expert in data crunching at terabyte scale on graphs and Hadoop. Previously, Chandra carried out research on image search indexing and retrieval and has built many architectures on enterprise integration and portals, a cloud search engine for ecommerce, and a framework for real-time news recommendation systems.

Presentations

From telco data to spatial-temporal intelligence APIs: Architecting through microservices Session

Creating big data solutions that can process data at terabyte scale and produce spatial-temporal real-time insights at speed demands a well-thought-through system architecture. Chandras Sekhar Saripaka details the production architecture at DataSpark that works through terabytes of spatial-temporal telco data each day in PaaS mode and showcases how DataSpark operates in SaaS mode.

Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

Presentations

Building and tuning machine-learning apps using Spark ML and GraphX Libraries Tutorial

Vartika Singh and Jayant Shekhar offers a hands-on tutorial that exposes you to techniques for building and tuning machine-learning apps using Spark ML libraries, building pipelines, tuning parameters, and graph processing with GraphX.

Vinay Shukla is the director of product management for Spark, Zeppelin, and Agile analytics at Hortonworks. Previously, Vinay worked as a developer and security architect. Vinay has been a frequent speaker at many conferences, including Hadoop Summit, Apache Big Data, JavaOne, and Oracle World. Vinay enjoys being on a yoga mat or on a hiking trail. You can follow him on his blog.

Presentations

Apache Spark: Enterprise security for production deployments Session

With enterprise adoption of Apache Spark come enterprise security requirements and the need to meet enterprise security standards. Vinay Shukla walks you through enterprise security requirements, provides a deep dive into Spark security features, and shows how Spark meets these enterprise security requirements.

Jiri Simsa is a software engineer at Alluxio and one of the maintainers and top contributors of the Alluxio open source project. Previously, he was a software engineer at Google, where he worked on the distributed framework for the IoT. Jiri holds a PhD in computer science from Carnegie Mellon University, where his work focused on systematic and scalable testing of concurrent systems.

Presentations

Alluxio (formerly Tachyon): An open source memory-speed virtual distributed storage system Session

Alluxio is an open source memory-speed virtual distributed storage system. In the past year, the Alluxio open source community has grown to more than 300 developers. The project also experienced a tremendous improvement in performance and scalability and was extended with new features. Haoyuan Li offers an overview of Alluxio, covering its use cases, its community, and the value it brings.

Office Hour with Jiri Simsa (Alluxio) Office Hours

Interested in Alluxio or storage? Stop by and meet Jiri.

Vartika Singh is a solutions architect at Cloudera with over 10 years of experience applying machine-learning techniques to big data problems.

Presentations

Building and tuning machine-learning apps using Spark ML and GraphX Libraries Tutorial

Vartika Singh and Jayant Shekhar offers a hands-on tutorial that exposes you to techniques for building and tuning machine-learning apps using Spark ML libraries, building pipelines, tuning parameters, and graph processing with GraphX.

Analyst covering Data and Analytics at Gartner

Presentations

A day in the life of a chief data officer (sponsored) Session

Join in to meet four experts who will share their views of the people, processes, and technologies that are driving information transformation around the world, including machine learning, big data, the cloud, and distributed computing. Find out why the role of chief data officer is at the center of driving tangible business value from data across the enterprise.

Chris is an experienced system developer of 18+ years with most focus on embedded electronic systems and networked system solutions. He has experience working on design solutions from concept, prototyping and on to mass manufacturing. His key areas of interest are in networked embedded computer systems and backend enterprise solutions for building application technologies. Chris has spent quite some time in the design, development (from scratch) and integration of RFID based Access Control Systems, SCADA and BMS solutions for the security, building management and custom electronics industries. He has managed and owned a business in the capacity of the technical officer for a security systems manufacturing concern. Chris hold a Bachelor of Engineering in Electronics and Control Engineering, a Master of Science in Embedded Systems and a diploma in IT Infrastructure Management. Presently, he holds the IT Manager position handling Development-Operations for the SinBerBEST program at the Berkeley Education Alliance for Research in Singapore (BEARS) Limited.

Presentations

Industrial big data and sensor time series data: Different but not difficult—Part II Session

Picking up where his talk at Strata + Hadoop World in London left off, Gopal GopalKrishnan shares lessons learned from using components of the big data ecosystem for insights from industrial sensor and time series data and explores use cases in predictive maintenance, energy optimization, process efficiency, production cost reduction, and quality improvement.

As chief data architect at Uber, M. C. Srivas worries about all data issues from trips, riders and partners, and pricing to analytics, self-driving cars, security, and data-center planning. Previously, M. C. was CTO and founder of MapR Technologies, a top Hadoop distribution; worked on search at Google, developing and running the core search engine that powered many of Google’s special verticals like ads, maps, and shopping; was chief architect at Spinnaker Networks (now Netapp), which formed the basis of Netapp’s flagship NAS products; and ran the Andrew File System team at Transarc, which was acquired by IBM. M. C. holds an MS from the University of Delaware and a BTech from IIT-Delhi.

Presentations

Real-time intelligence gives Uber the edge Keynote

M. C. Srivas covers the technologies underpinning the big data architecture at Uber and explores some of the real-time problems Uber needs to solve to make ride sharing as smooth and ubiquitous as running water, explaining how they are related to real-time big data analytics.

Bargava Subramanian is an India-based data scientist at Cisco Systems. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimization in Python and R around the world. Bargava holds a master’s degree in statistics from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations

Deep learning for natural language processing Session

Ever wondered how Google Translate works so well, how the autocaptioning works on YouTube, or how to mine the sentiments of tweets on Twitter? What’s the underlying theme? They all use deep learning. Bargava Subramanian and Amit Kapoor explore artificial neural networks and deep learning for natural language processing to get you started.

Machine learning: The power of ensembles Session

Creating better models is a critical component to building a good data science product. It is relatively easy to build a first-cut machine-learning model, but what does it take to build a reasonably good or state-of-the-art model? Ensemble models—which help exploit the power of computing in searching the solution space. Bargava Subramanian discusses various strategies to build ensemble models.

Yoshitaka Suzuki is a researcher in information science and technology at IHI Corporation. Yoshitaka has developed anomaly detection algorithms for several kinds of products, such as industrial machines and engines, but is now responsible for utilizing sensor data, developing software for anomaly detection and fault diagnosis, and verifying the practical effectiveness of distributed processing systems. Prior to IHI, he spent four years developing anomaly detection algorithms for machinery systems and social infrastructures at Kozo Keikaku Engineering Inc. Yoshitaka holds an MEng in aeronautics and astronautics from the University of Tokyo.

Presentations

IoT and Spark MLlib applications for improving products, services, and manufacturing technologies Session

IHI has developed a common platform for remote monitoring and maintenance and has started leveraging Spark MLlib to get up speed developing applications for process improvement and product fault diagnosis. Yoshitaka Suzuki and Masaru Dobashi explain how IHI used PySpark and MLlib to improve its services and share best practices for application development and lessons for operating Spark on YARN.

Karthik Bharadwaj is a senior data scientist in the Data Science Center of Expertise at Teradata, where he provides analytic thought leadership and generating demand for Teradata products. Karthik has seven years of experience in working in the data management and analytics industry. Previously, he worked as a researcher at IBM Research to develop smarter transportation systems that predict traffic on the Singapore road network. Karthik holds a master’s degree from the National University of Singapore.

Presentations

Making sense of the sensors: Connecting the IoT and analytics Session

Combining the power of the IoT and big data analytics opens the doors to a wide range of opportunities for organizations to solve new challenges that create an impact on the world that we live in. Frank Saeuberlich and Karthik Thirumalai explain why data management, data integration, and multigenre analytics are foundational to driving business value from IoT initiatives.

Wee Hyong Tok is a principal data science manager at Microsoft, where he works with teams to cocreate new value and turn each of the challenges facing organizations into compelling data stories that can be concretely realized using proven enterprise architecture. Wee Hyong has worn many hats in his career, including developer, program/product manager, data scientist, researcher, and strategist, and his range of experience has given him unique super powers to nurture and grow high-performing innovation teams that enable organizations to embark on their data-driven digital transformations using artificial intelligence. He has a passion for leading artificial intelligence-driven innovations and working with teams to envision how these innovations can create new competitive advantage and value for their business and strongly believes in story-driven innovation.

Presentations

Dancing with intelligent dragon drones Session

Jennifer Marsman, Ranveer Chandra, and Wee Hyong Tok explore the various drone technologies that are currently available and explain how to acquire and analyze real-time signals from drones to design intelligent applications.

Anusua Trivedi is a data scientist on Microsoft’s Advanced Data Science and Strategic Initiatives team, where she works on developing advanced predictive analytics and deep learning models. Previously, Anusua was a data scientist at the Texas Advanced Computing Center (TACC), a supercomputer center, where she developed algorithms and methods for the supercomputer to explore, analyze, and visualize clinical and biological big data. Anusua is a frequent speaker at machine learning and big data conferences across the United States, including Supercomputing 2015 (SC15), PyData Seattle 2015, and MLconf Atlanta 2015. Anusua has also held positions with UT Austin and University of Utah.

Presentations

Transfer learning and fine-tuning deep neural network models across different domains Session

Anusua Trivedi proposes a method to apply a pretrained deep convolution neural network (DCNN) on images to improve prediction accuracy. This approach improves prediction accuracy on domain-specific image datasets compared to state-of-the-art machine-learning approaches.

Combining an extensive background in product research, data analysis, program management, and software development, Cameron Turner cofounded the Data Guild in 2013. Previously, he founded ClickStream Technologies, which was acquired by Microsoft. While at Microsoft, Cameron managed the Windows Telemetry team, responsible for all inbound data for all Microsoft products and partners. He is an active member of and speaker at a number of Bay Area tech groups, including Churchill Club, SOFTECH, the Young CEOs Club, the CIO Roundtable, and BayCHI. Cameron holds a BA in architecture from Dartmouth College, an MBA from Oxford University, and an MS in statistics from Stanford University.

Presentations

Finding profit in your organization's data exhaust Session

Huge amounts of data are generated every minute by nearly every company. . .which largely goes unused. Historically, so-called data exhaust has been collected for the purpose of manual analysis in the case of a fault or failure. Cameron Turner explains why companies are increasingly looking to their data exhaust as a valuable asset to influence their revenue and profit through machine learning.

Arun Veettil currently work as principal data scientist for personalization and loyalty analytics for Starbucks. For the last five years, Arun has been working at the intersection of data science and product development, helping companies develop intelligent data products. He is currently focused on developing algorithms for personalization and marketing campaign optimization for Starbucks. Previously, Arun worked at Point Inside, Nordstrom Advanced Analytics, the Walt Disney Company, and IBM. His expertise includes developing machine-learning algorithms to run against very large amounts of data and building large-scale distributed applications. Arun holds a master’s degree in computer science from the University of Washington and a bachelor’s degree in electronics engineering from National Institute of Technology, Allahabad, India.

Presentations

Context-aware recommendations using reinforcement learning in the item-similarity space Session

Making recommendations for the food and beverage industry is tricky as they must take into consideration the user's context (location, time, day, etc.) in addition to the constraints of a regular recommendation algorithm. Arun Veettil explains how to incorporate user contextual information into recommendation algorithms and apply reinforcement learning to track continuously changing user behavior.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the creation of the Lightbend Fast Data Platform, a streaming data platform built on the Lightbend Reactive Platform, Kafka, Spark, Flink, and Mesosphere DC/OS. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects and he is the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Just enough Scala for Spark Tutorial

Apache Spark is written in Scala. Hence, many—if not most—data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs.

Office Hour with Dean Wampler (Lightbend) Office Hours

Discuss Spark, Scala, and streaming data architectures with Dean.

Scala and the JVM as a big data platform: Lessons from Apache Spark Session

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Yiheng Wang is a software development engineer on the Big Data Technology team at Intel working in the area of big data analytics. Yiheng and his colleagues are developing and optimizing distributed machine learning algorithms (e.g., neural network and logistic regression) on Apache Spark. He also helps Intel customers build and optimize their big data analytics applications.

Presentations

Web-scale machine learning on Apache Spark Session

Jason Dai and Yiheng Wang share their experience building web-scale machine learning using Apache Spark—focusing specifically on "war stories" (e.g., in-game purchase, fraud detection, and deep leaning)—outline best practices to scale these learning algorithms, and discuss trade-offs in designing learning systems for the Spark framework.

Sara M. Watson is a technology critic and a research fellow at the Tow Center for Digital Journalism as well as an affiliate with the Berkman Center for Internet and Society at Harvard University. Sara’s work explores how we are learning to live with, understand, and interpret our personal data and the algorithms that shape our experiences. Her writing has appeared in the Atlantic, Al Jazeera America, Wired, Gizmodo, the Harvard Business Review, and Slate. Sara previously worked as an enterprise technology analyst at the Research Board (Gartner, Inc.), exploring implications of technological trends for Fortune 500 CIOs. She also consults with organizations such as Crimson Hexagon, Brightcove, and the World Economic Forum on data practices and policies. She cohosts the Mindful Cyborgs podcast, reviews books on Goodreads, ran Tech Book Club and Angry Tech Salon at the Berkman Center, and writes a newsletter about her adventures in Southeast Asia. Sara holds an MSc in the social science of the Internet with distinction from the Oxford Internet Institute, where her award-winning thesis used ethnographic methods to examine the personal data interests of the Quantified Self community. She graduated from Harvard College magna cum laude with a joint degree in English and American literature and film studies. Sara enjoys emoji, karaoke, emoji karaoke, and making lists. She’s married to the brilliant and intrepid Nick R. Smith. She tweets as @smwat.

Presentations

Office Hour with Sara M. Watson (Tow Center for Digital Journalism) Office Hours

Most personalization today is rudimentary and coarsely-targeted, at best—and when it gets us wrong, it’s a negative experience. Sara will chat about ways to improve personalized experiences for users.

Taking personalization personally Keynote

Most consumer-facing personalization today is rudimentary and coarsely targeted at best, and designers don’t give users cues for how they are meant to interact with and interpret personalized experiences and interfaces. Sara Watson makes the case for personalization signals that give context to personalization and expose levers of control to users.

Graham Williams is Director of Data Science at Microsoft with responsibility covering the Asia/Pacific region. Graham joined Microsoft after over 30 years in Australia as a data scientist leading research and deployments in artificial intelligence, machine learning, data mining, analytics, and data science. Graham has authored a number of books introducing data mining and machine learning using the R statistical software. He was previously principal data scientist with the Australian Taxation Office and lead data scientist with the Australian Government’s Centre of Excellence in Analytics, where he assisted numerous government departments and Australian industry in creating and building data science capabilities. He has worked on many projects focused on delivering solutions and applications driven by data using machine learning and artificial intelligence technologies. He is an adjunct professor with the University of Canberra and the Australian National University and an international visiting professor with the Chinese Academy of Sciences.

Presentations

Cloud AI innovations (sponsored) Session

Powerful artificial intelligence is emerging as the underlying technology for future intelligent applications. Building new applications requires considerable analysis of data performed on powerful computers. Graham Williams explains why the cloud offers a cost-effective platform for developers and demonstrates this capability with actual applications.

Matt Winkler is a principal group program manager in the Data group at Microsoft, where he leads a program management team building services and tools for developers to build intelligent apps using cognitive APIs, the Bot Framework, and the Cortana Intelligence Suite. Matt has worked at Microsoft for the last 10 years as an evangelist and a program manager working on the .NET Framework, Visual Studio, and Azure Web Sites. As part of the Microsoft Big Data team, Matt led a PM team building HDInsight, Microsoft’s managed Hadoop and Spark service and Azure data lake analytics. Matt holds a bachelor of science in mathematics and computer science from Denison University and an MBA from Washington University in St. Louis. In his free time, Matt enjoys skiing, hiking, and woodworking.

Presentations

Bots as the next UX: Expanding your apps with conversation Session

Matt Winkler and Jennifer Marsman explain how to easily extend your apps and services with bots to reach users where they are—in messaging apps—covering use cases and case studies, how to quickly get started building a bot, how to process input using linguistic analysis, and how to deploy and integrate bots with messaging apps.

Hee Sun Won is a principal researcher at the Electronic and Telecommunications Research Institute (ETRI) and leads the Collaborative Analytics Platform for BDaaS (big data as a service) and analytics for the Network Management System (NFV/SDN/cloud). Her research interests include multitenant systems, cloud resource management, and big data analysis.

Presentations

Unified metadata management for scalability, integrity, and reliability across geographically distributed data centers Session

Minh Chau Nguyen and Hee Sun Won demonstrate how all metadata from system, service, and user can be managed in one unified platform across many geographically distributed data centers by extending the overall architecture of the Hadoop ecosystem so that multiple tenants and authorized third parties can securely access and modify the metadata in runtime via a so-called metadatabase.

Qiaoliang Xiang is currently the head of data science at ShopBack, where he focuses on setting up big data infrastructure to store and process data, building data pipelines to provide clean, accurate, and consistent data, creating self-service reporting tools to satisfy other teams’ data requests, and developing data science products to serve customers better. Previously, he was a data scientist at Lazada working on product attribute extraction, a data engineer at Visa analyzing financial transactions, and a research assistant at NUS and NTU focusing on information retrieval, machine learning, and natural language processing. Qiaoliang holds a MEng from Nanyang Technological University, Singapore.

Presentations

Crawling and tracking millions of ecommerce products at scale Session

Shopback, a company that gives cash back to customers for successful transactions covering various lifestyles, crawls 25 million products from multiple ecommerce websites to provide a smooth customer experience. Qiaoliang Xiang walks you through how to crawl and update products, how to scale it using big data tools, and how to design a modularized system.

Ei Yamaguchi is a system infrastructure engineer and leads the OSS professional service team at NTT DATA Corporation, where he is responsible for introducing Hadoop, Spark, and other OSS middleware into enterprise systems. Previously, Ei led system development in the financial industry and developed an enterprise Hadoop cluster and machine-learning applications.

Presentations

Integrated data analytics for consumer electronics using Hadoop and Spark MLlib Session

Takayuki Nishikawa and Ei Yamaguhi explain how Panasonic developed an integrated data analytics platform to analyze the increasing number of home appliances logs from its IoT products, achieving scalability for millions of households and a 10x improvement in processing time with Hadoop and Hive, in the process gaining more reliable knowledge about users’ lifestyles with Spark MLlib.

Eugene Yan is a data scientist at Lazada working on product and user problems to improve the online shopping experience for consumers and sellers across Southeast Asia. Passionate and experienced in using data to build data products and create positive impact, he is proficient in research design, data preparation, feature engineering, machine learning, ensemble, validation, and A/B testing. Also familiar with Python, Scala, Spark, R, SQL, Elastic, AWS, and software engineering and production practices.

Presentations

How Lazada ranks products to improve customer experience and increase conversion Session

As the number of products on Lazada grows exponentially, helping customers find relevant, quality products is key to customer experience. Eugene Yan shares how Lazada ranks products on its website, covering how Lazada scales data pipelines to collect user-behavioral data, cleans and prepares data, creates simple features, builds models to meet key objectives, and measures outcomes.

Shao Wei Ying is the COO of DataSpark, where he is responsible for leading efforts to harness the big data insights from telco networks using advanced geospatial analytics and producing deep profiling insights for enterprises and government agencies. Shao Wei is currently involved in expanding DataSpark services regionally from its Singapore base.

Presentations

Mobility as a vital sign of people and the economy Session

Shao Wei Ying explains how mobility intelligence derived from telco big data informs us about the state of our urban infrastructure, economic activities, and public safety.

Zia Zaman is MetLife’s chief innovation officer for the Asia region, where he is responsible for leading LumenLab, the industry-first innovation center, and steering the innovation agenda for the Asia region. Zia is also a member of MetLife’s Asia leadership group. Previously, Zia was the chief strategy officer and vice president of emerging businesses for SingTel’s group enterprise, where he co-led the acquisition of Amobee, jumpstarting SingTel’s entry into the mobile advertising market. Prior to SingTel, Zia was the chief strategy officer for LG Electronics North America; the chief marketing officer at FAST, where he successfully positioned FAST as a leader in the enterprise search space, culminating in a US$1.2B acquisition by Microsoft; and head of the North American strategy consulting practice at Gartner. Early in his career, Zia was a member of the M&A team at Sun Microsystems. Zia has held a number of board positions at tech startups—he was a board member on Viking and currently sits on the board of the Energy Market Authority of Singapore. Zia holds an MBA from Stanford’s GSB as well as a bachelor of science in electrical engineering and a master of science in operations research, both from MIT. Zia is married with two children and lives in Singapore.

Presentations

Disruption in insurance: Seven predictions Keynote

Zia Zaman shares seven predictions about the future of insurance.

San Zaw is the head of pre-sales in Asia for TIBCO. Based in Singapore, San leads the regional solutions consultant team and is responsible for business growth in the pre-sales organization in this region. He is a practitioner in contextual mobility and digital services and works with Asia’s leading financial services institutions and communications service providers to implement game-changing solutions and deliver differentiated customer experiences. A seasoned veteran in the telecommunications industry, San is a regular speaker and thought leader at financial and mobility innovation circles. His interests include helping enterprises monetize on their social ecosystem, exposing businesses to the API economy, and advocating self-service platforms for merchants and small business owners. San brings along several years of experience in the field of infocomm technology and built a track record in solving complex business challenges for enterprises ranging from transportation and logistics, healthcare, and gaming to defense and statutory bodies. During his career at TIBCO, he has helped spearhead the development of the business in the emerging markets.

Presentations

Augmenting intelligence in an interconnected world Session

The interconnected world presents unprecedented opportunities to gain new insights on behavior, both human and nonhuman alike. Likewise, it also poses unprecedented challenges on how organizations can act on these moments of opportunities in time. Michael O'Connell and San Zaw share real-world case studies demonstrating how real-time analytics solves these challenges.

Weidong Zhang is an engineering manager on the Data Analytics Infrastructure team at LinkedIn and leads the marketing and customer-service data warehouse vertical. Weidong has a passion for analytics, research, and data-driven decision making. He spent 10+ years in the data warehouse ETL and BI reporting fields and leverages his knowledge with business intelligence and Hadoop’s massive data-processing capability to address business needs. Weidong earned his PhD in computation fluid dynamics.

Presentations

Understanding the voice of members via text mining: How Linkedin built a text analytics engine at scale Session

Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang explain how LinkedIn has built a "voice of member" platform to analyze hundreds of millions of text documents. Chi-Yi, Weidong, and Yongzheng illustrate the critical components of this platform and showcase how LinkedIn leverages it to derive insights such as customer value propositions from an enormous amount of unstructured data.

Yongzheng Zhang is a business analytics manager at LinkedIn and an active researcher and practitioner of text mining and machine learning. He has developed many practical and scalable solutions for utilizing unstructured data for ecommerce and social-networking applications, including search, merchandising, social commerce, and customer-service excellence. Yongzheng is a highly regarded expert in text mining and has published and presented many papers in top journals and at conferences. He is also actively organizing tutorials and workshops on sentiment analysis at prestigious conferences. He holds a PhD in computer science from Dalhousie University in Canada.

Presentations

Understanding the voice of members via text mining: How Linkedin built a text analytics engine at scale Session

Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang explain how LinkedIn has built a "voice of member" platform to analyze hundreds of millions of text documents. Chi-Yi, Weidong, and Yongzheng illustrate the critical components of this platform and showcase how LinkedIn leverages it to derive insights such as customer value propositions from an enormous amount of unstructured data.

Imron Zuhri is the founder and chief technical director at Mediatrac, where he is responsible for herding the pack of nerds, the data scientist, and data engineers in the company. Together with his wife, Imron also established Erudio School of Art, the only democratic school of the arts high school in Indonesia. He has a wide interest in math, physics, astronomy, movies, music, photography, and literature, but first and foremost, he is obsessed with understanding human behavior, perhaps to compensate for his lack of social interaction.

Presentations

Using big data technology to solve data connectivity in a disconnected world Session

Mediatrac, a big data technology platform focused on data connectivity, object profiling, and knowledge discovery, allows businesses and startups to build advanced analytic solutions on top of it. Imron Zuhri shares several data connectivity use cases and explains how to leverage distributed computing to tackle massive entity recognition and resolution problems.