Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Strata + Hadoop World 2016 Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Jeremy Achin is a data scientist turned entrepreneur with a vision. In his current role as CEO of DataRobot, Jeremy sets the direction of the company, products, and culture. He’s considered a data science thought leader and spends significant time visiting the world’s largest organizations to speak about practicing data science at scale. Previously, Jeremy was director of research and modeling at Travelers Insurance, where he built predictive models for pricing, retention, conversion, elasticity, lifetime value, customer behavior, claims, and much more. A true data science enthusiast, Jeremy even spends his spare time building predictive models, usually on the data science competition platform Kaggle.com. Jeremy studied math, physics, computer science, and statistics at the University of Massachusetts, Lowell.

Presentations

Data science for executives Session

In today's world, executives need to be the drivers for data science solutions. Data analysis has moved from the domain of data scientists to the forefront of core strategic initiatives. Are you empowering your team to identify and execute on every opportunity to optimize business with machine learning? In this session, you will learn how executives are transforming business with machine learning.

Sherri Adame is an enterprise metadata management leader at Cigna, where she manages the enterprise wide rollout of metadata management practices for Cigna’s data lake through agilely delivered programs and projects. Sherri brings deep experience building high-performance information management teams across different industries, including retail, finance, and healthcare. She was named one of the 25 Top Information Managers in 2011 and recognized as one of the movers, shakers, and game changers who are making information work for business by Information Management magazine. Sherri has spoken at many MDM and data governance summits. Her strong communication and relationship-building skills are able to effectively sell the promise of her organization to customers, senior management, and inspire her team.

Presentations

Governance and metadata management of Cigna's enterprise data lake Session

Launched in late 2015, Cigna's enterprise data lake project is taking the company on a data governance journey. Sherri Adame offers an overview of the project, providing insights into some of the business pain points and key drivers, how it has led to organizational change, and the best practices associated with Cigna’s new data governance process.

Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Ask me anything: Stream processing with Apache Beam and Google Cloud Dataflow engineers AMA

Join Apache Beam and Google Cloud Dataflow engineers to ask all of your questions about stream processing. They'll answer everything from general streaming questions about concepts, semantics, capabilities, limitations, etc. to questions specifically related to Apache Beam, Google Cloud Dataflow, and other common streaming systems (Flink, Spark, Storm, etc.).

Learn stream processing with Apache Beam Tutorial

Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau and Jesse Anderson cover the basics of robust stream processing (windowing, watermarks, and triggers) with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow.

The evolution of massive-scale data processing Session

Tyler Akidau offers a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, comparing and contrasting systems at Google with popular open source systems in use today.

Erin Akred is a former Presidential Innovation Fellow and data and analytics specialist focusing on improving the human experience through sustainability, education, and healthy lifestyles. Her work over the past 15 years—spanning industries and academia across the public and private sectors and resulting in multiple patents and awards—illustrates the art of the possible in a world of abundant data.

Presentations

A collaboration in civic tech: Improving traffic safety nationwide Data Case Studies

The global movement Vision Zero aims to reduce traffic fatalities and severe injuries to zero. Erin Akred and Michael Dowd explore a partnership between Microsoft, a team of DataKind data scientists, government officials, and researchers that has been working to leverage newly available datasets to inform cities’ efforts nationwide to reduce traffic-related deaths and severe injuries to zero.

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred, Mauricio Vacas, and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Stephen O'Sullivan, and Julie Steele will field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for CDO and its evolving role. Even if you don’t have a specific question, join in to hear what others are asking.

Navdeep (Nav) Alam brings more than 15 years of experience in software engineering, databases, data warehousing, analytics, architecture, and development to his role as the director of global data warehousing at IMS Health, where he is charged with managing the global data warehousing organization as a center of excellence and defining and executing its future roadmap, which includes next-generation massive parallel processing (MPP), low-latency data warehousing systems. Nav is also a graduate teaching assistant at Boston University, where he assists in teaching graduate students on enterprise computing, advanced databases, data mining, and business intelligence.

Previously, Nav was the director of analytics and prediction for Empirix, where he led a global team in the architecture and development of its next-generation analytics platform, IntelliSight; director of data architecture for Mzinga’s social intelligence applications as part of its OmniSocial SaaS platform; principal software engineer for KnowledgePlanet, Mzinga’s predecessor, where he was the principal architect and developer of Firefly Simulation Developer; in the Information Technology and Application Support group at Calgary’s Nova Chemical Research and Technology Center, where he provided Y2K support and developed an interactive laboratory information management system web tutorial application; and a Unix administrator managing the Oracle data systems for Syncrude Canada and EI Processing, building filtering algorithms to scrub noise for seismic data processing. Nav holds an MS in computer science from Boston University and a bachelor’s degree in computer science from the University of Calgary.

Presentations

How the largest US healthcare dataset in Hadoop enables patient-level analytics in near real time Session

The need to find efficiencies in healthcare is becoming paramount as our society and the global population continue to grow and live longer. Navdeep Alam shares his experience and reviews current and emerging technologies in the marketplace that handle working with unbounded, de-identified patient datasets in the billions of rows in an efficient and scalable way.

Sridhar Alla is the director of big data solutions and architecture at Comcast, where he has delivered several key solutions, such as the Xfinity personalization platform, clickthru analytics, and the correlation platform. Sridhar started his career in network appliances on NAS and caching technologies. Previously, he served as the CTO of security company eIQNetworks, where he merged the concepts of big data and security products. He holds patents on the topics of very large-scale processing algorithms and caching.

Presentations

Powering real-time analytics on Xfinity using Kudu Session

Sridhar Alla and Kiran Muglurmath explain how real-time analytics on Comcast Xfinity set-top boxes (STBs) help drive several customer-facing and internal data-science-oriented applications and how Comcast uses Kudu to fill the gaps in batch and real-time storage and computation needs, allowing Comcast to process the high-speed data without the elaborate solutions needed till now.

Kyle Ambert is lead data scientist at Intel’s Artificial Intelligence and Analytics Solutions group, where he uses machine learning and statistical methods to solve real-world big data problems. Currently, his research centers around novel applications of machine learning in the health and life sciences. Kyle contributes to the data science direction of the Trusted Analytics Platform, particularly as it pertains to analytical pipeline and algorithm development. He holds a BA in biological psychology from Wheaton College and a PhD in biomedical informatics from Oregon Health & Science University, where his research focused on text analytics and developing machine-learning optimization solutions for biocuration workflows in the neurosciences.

Presentations

Create advanced analytic models with open source Session

Creating production-ready analytical pipelines can be a messy, error-prone undertaking. Kyle Ambert explores the Trusted Analytics Platform, an open source-based platform that enables data scientists to ask bigger questions of their data and carry out principled data science experiments—all while engaging in iterative, collaborative development of production solutions with application developers.

Khaled Ammar is an expert in big data distributed systems with industrial experience in data analysis for banking and politics. He is currently a data scientist at the Thomson Reuters Innovation lab in Waterloo, Canada. He is also a PhD student in the Data Systems group at the University of Waterloo’s David R. Cheriton School of Computer Science.

Presentations

A data-driven approach to the US presidential election Session

Amir Hajian, Khaled Ammar, and Alex Constandache offer an approach to mining a large dataset to predict the electability of hypothetical candidates in the US presidential election race, using machine learning, natural language processing, and deep learning on an infrastructure that includes Spark and Elasticsearch, which serves as the backbone of the mobile game White House Run.

Alasdair Anderson is the executive vice president and head of data engineering at Nordea Bank in Copenhagen, where he leads Nordea Data Technology, which is responsible for ensuring that the bank data platforms supply timely, accurate, and trusted data to internal and external subscribers in order to support all of the bank core functions. Alasdair is also responsible for the delivery of the bank Financial Crime Intelligence and Analytics platform. Alasdair speaks frequently throughout Europe on the topics of data management, analytics, and innovation and represents Nordea on multiple customer advisory boards of the data and analytics vendors with which Nordea partner.

Presentations

Why is this disruption different from all other disruptions? Hadoop as a game changer in financial services Session

What's the point at which Hadoop tips from a Swiss-army knife of use cases to a new foundation that rearranges how the financial services marketplace turns data into profit and competitive advantage? This panel of expert practitioners looks into the near future to see if the inflection point is at hand.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Learn stream processing with Apache Beam Tutorial

Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau and Jesse Anderson cover the basics of robust stream processing (windowing, watermarks, and triggers) with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow.

Spark and Java: Yes, they work together Session

Although Spark gets a lot of attention, we only think about two languages being supported—Python and Scala. Jesse Anderson proves that Java works just as well. With lambdas, we even get syntax comparable to Scala, so Java developers get the best of both worlds without having to learn Scala.

Presentations

Five ways to modernize your BI tools and make them work on more data Session

More data exists than ever before and in more disparate silos. Getting the insights you need, sifting through data, and answering new questions have all been complex, hairy tasks that only data jocks have been able to do. Andrew Yeung and Scott Anderson explore new ways to challenge the status quo and speed insights on diverse sources and demonstrate real customer use cases.

June Andrews is a Principal Data Scientist at Wise/GE Digital working on a machine learning and data science platform for the Industrial Internet of Things, which includes aviation, trains, and power plants. Previously, she worked at Pinterest spearheading the Data Trustworthiness and Signals Program to create a healthy data ecosystem for machine learning. She has also lead efforts at LinkedIn on growth, engagement, and social network analysis to increase economic opportunity for professionals. June holds degrees in applied mathematics, computer science, and electrical engineering from UC Berkeley and Cornell.

Presentations

Iterative supervised clustering: A dance between data scientists and machine learning Session

Clustering algorithms produce vectors of information, which are almost surely difficult to interpret. These are then laboriously translated by data scientists into insights for influencing product and executive decisions. June Andrews offers an overview of a human-in-the-loop method used at Pinterest and LinkedIn that has lead to fast, accurate, and pertinent human-readable insights.

Amitai Armon is the chief data scientist for Intel’s Advanced Analytics group, which provides solutions for the company’s challenges in diverse domains ranging from design and manufacturing to sales and marketing, using machine learning and big data techniques. Previously, Amitai was the cofounder and director of research at TaKaDu, a provider of water-network analytics software to detect hidden underground leaks and network inefficiencies. The company received several international awards, including the World Economic Forum Technology Pioneers award. Amitai has about 15 years of experience in performing and leading data science work. He holds a PhD in computer science from the Tel Aviv University in Israel, where he previously completed his BSc (cum laude, at the age of 18).

Presentations

Fast deep learning at your fingertips Session

Amitai Armon and Nir Lotan outline a new, free software tool that enables the creation of deep learning models quickly and easily. The tool is based on existing deep learning frameworks and incorporates extensive optimizations that provide high performance on standard CPUs.

Amar Arsikere is the founder and CEO at Infoworks.io. Previously, Amar built several large scale data systems on Bigtable and Hadoop at Google and Zynga. At Zynga, he also led the design and deployment of the gaming database, the largest in-memory database in the world built at that time. At Google, he pioneered the development of data warehousing platform on Bigtable. Amar is a recipient of InfoVision award from IEC and the Jars Top 25 award and holds several patents in the field of software and Internet technologies.

Presentations

Data warehouse augmentation and modernization using Hadoop Session

Current data warehouse technologies are increasingly challenged to handle the growth in data volume, new data types, and multiple analytics types. Hadoop has the potential to address these issues, but you need to solve several complexities before you can realize its full benefits. Amar Arsikere showcases the business and technical aspects of augmenting and modernizing data warehouses on Hadoop.

Caitlin Augustin is a data ambassador within DataKind’s DataCorps, a pro bono team of data scientists working with social change organizations to transform their work and their sector. When not at Civic Hall working on these exciting projects, Caitlin can be found juggling her day job as a research scientist at Kaplan Test Prep and finishing her PhD in environmental sciences through the Abess Center for Ecosystem Science and Policy. She’s passionate about international development and environmental policy and has participated in numerous conferences such as the American Geophysical Union, Engineers Without Borders, and the American Meteorological Society’s policy colloquium. She holds a BS in industrial engineering from the University of Miami and is currently a PhD candidate at the same university. Go ‘Canes!

Presentations

Adventures from the frontlines of data for good Session

JeanCarlo Bonilla, Susan Sun, and Caitlin Augustin explore how DataKind volunteer teams navigate the road to social impact by automating evidence collection for conservationists and helping expand the reach of mobile surveys so that more voices can be heard.

Ghazal Badiozamani is vice president of corporate strategy at Elsevier, where she is helping transform the company from a traditional publisher (owner of 20% of the world’s academic journals) to a data-driven, digital decision support provider for the medical and scientific community. Prior to Elsevier, Ghazal worked with partners at Kleiner Perkins Caufield and Byers to incubate two companies. In a prior life, she spent almost a decade as a program officer at the United Nations, mediating global environmental negotiations. Ghazal holds an undergraduate degree from Stanford, an MSc from the London School of Economics, and an MBA from Wharton.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

From static content to adaptive learning Data Case Studies

Transforming a mature market requires a radical departure from business as usual. Ghazal Badiozamani explores how global STEM information company Elsevier developed a completely novel approach to delivering learning content by coming together as a highly collaborative team, becoming obsessed with customer feedback, and taking a data-driven approach to delivery.

Richard Baumgartner is a senior principal scientist with the biometrics research department of Biostatistics and Research Decision Sciences (BARDS) at Merck and Co. While at Merck, Richard has been supporting early clinical and preclinical studies with imaging components, including functional magnetic resonance imaging (fMRI), dynamic contrast-enhanced MRI (DCE-MRI), and positron emission tomography (PET) imaging for neuroscience, inflammation, and cardiovascular therapeutic areas. Previously, he was an associate research officer with the Institute for Biodiagnostics at the National Research Council Canada in Winnipeg, where he pioneered development of methods for exploratory data analysis of fMRI and worked on machine-learning applications to develop diagnostic biomarkers for prediction of pathogenic fungi and breast cancer.

Presentations

Cold chain analytics: Using Revolution R and the Hadoop ecosystem Data Case Studies

Nitin Kaul and Richard Baumgartner demonstrate how Merck applies descriptive, predictive, and prescriptive analytics leveraging parallel distributed libraries and the predictive modeling capabilities of Revolution R deployed on a secure Hadoop cluster to identify the various factors for product temperature excursions and predict and prevent future temperature excursions in product shipments.

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Maxime Beauchemin recently joined Airbnb as a data engineer developing tools to help streamline and automate data-engineering processes. He mastered his data-warehousing fundamentals at Ubisoft and was an early adopter of Hadoop/Pig while at Yahoo in 2007. More recently, at Facebook, he developed analytics-as-a-service frameworks around engagement and growth-metrics computation, anomaly detection, and cohort analysis. He’s a father of three, and in his free time, he’s a digital artist. You can read more about his projects on his blog, Digital Artifacts.

Presentations

Caravel: An open source data exploration and visualization platform Session

Airbnb developed Caravel to provide all employees with interactive access to data while minimizing friction. Caravel's main goal is to make it easy to slice, dice, and visualize data. Maxime Beauchemin explains how Caravel empowers each and every employee to perform analytics at the speed of thought.

Marie Beaugureau is the lead data editor for O’Reilly Media.

Presentations

Data 101 Tutorial

Data 101 introduces you to core principles of data architecture, teaches you how to build and manage successful data teams, and inspires you to do more with your data through real-world applications. Setting the foundation for deeper dives on the following days of Strata + Hadoop World, Data 101 reinforces data fundamentals and helps you focus on how data can solve your business problems.

Roy Ben-Alta is the big data analytics business development manager at Amazon Web Services, where he works with AWS customers in building data-driven products, whether batch or real time, and creating analytics solutions in the cloud. Roy has worked in the data and analytics industry for over a decade and has helped hundreds of customers bring compelling data-driven products to the market.

Presentations

Amazon Kinesis: Real-time streaming data in the AWS cloud Session

Roy Ben-Alta explores the Amazon Kinesis platform in detail and discusses best practices for scaling your core streaming data ingestion pipeline as well as real-world customer use cases and design pattern integration with Amazon Elasticsearch, AWS Lambda, and Apache Spark.

Deborah Berebichez is a physicist, data scientist, and the cohost of Discovery Channel’s Outrageous Acts of Science TV, where she uses her physics background to explain the science behind extraordinary engineering feats. Deborah has also appeared as an expert on the Travel Chanel, NOVA, CNN, FOX, MSNBC, and numerous international media outlets. Deborah is currently the chief data scientist at Metis, where she leads the creation and growth of exceptional data science training opportunities, ensuring the excellence of Metis’s data science bootcamps, corporate training, professional development, and online programs. She is an active contributor to the national data science ecosystem through frequent public speaking and presentations on panels at data science conferences.

Deborah’s work in science education and outreach has been recognized by the Wall Street Journal, Oprah, The Dr. Oz Show, TED, DLD, Wired, Ciudad de las Ideas, and others. Her passion is to empower young people to learn science and to improve the state of STEM education in the world. She is a John C. Whitehead Fellow at the Foreign Policy Association, a winner of the Society of SHPE’s STAR Award, and a recipient for Top Latina Tech Blogger by the Association of Latinos in Social Media LATISM. Deborah is the first Mexican woman to graduate with a physics PhD from Stanford University, and she completed postdoctoral fellowships at Columbia University’s Applied Math and Physics Department and at NYU’s Courant Institute for Mathematical Sciences, where she carried out research in the area of acoustic waves. She invented a highly effective technique in the field of wireless communications whereby a cell phone user can communicate with a desired target user in a location far away.

Presentations

Statistics and the art of deception Data 101

Data scientists use statistics to reach meaningful conclusions about data. Unfortunately, statistical tools are often misapplied, resulting in errors that cost both time and money. Deborah Berebichez presents examples of egregious misuses of statistics in business, technology, science, and the media and outlines the simple steps that can reduce the chance of being fooled by statistics.

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer experience. Tim can frequently be found at speaking at conferences internationally and in the United States. He is the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio Podcast. He lives in Littleton, Colorado, with the wife of his youth and and their youngest child, the other two having mostly grown up.

Presentations

Certification for Apache Cassandra: Get trained, get certified, get better paid Training

O’Reilly Media and DataStax have partnered to create a 2-day developer certification course for Apache Cassandra. Get certified as a Cassandra developer at Strata + Hadoop World in New York and be recognized for your NoSQL expertise.

Certification for Apache Cassandra: Get trained, get certified, get better paid (Day 2) Training day 2

O’Reilly Media and DataStax have partnered to create a 2-day developer certification course for Apache Cassandra. Get certified as a Cassandra developer at Strata + Hadoop World in New York and be recognized for your NoSQL expertise.

David Beyer is currently an investor with Amplify Partners, a $50M VC firm focused exclusively on early-stage IT infrastructure and data companies. David began his career in technology as the cofounder and CEO of Chartio.com, a pioneering provider of cloud-based data visualization and analytics. He was part of the founding team at Patients Know Best, one of the world’s leading cloud-based personal health record (PHR) companies. David has also been a prolific investor and advisor to entrepreneurs. He has angel invested in over 35 early-stage companies, including Tracelytics (acquired by AppNeta), Teambox, Modria, ReTargeter, and Teespring.

Presentations

Machine intelligence in the wild: How AI will reshape global industries Session

Society is standing at the gates of what promises to be a profound transformation in the nature of work, the role of data, and the future of the world's major industries. Intelligent machines will play a variety of roles in every sector of the economy. David Beyer explores a number of key industries and their idiosyncratic journeys on the way to adopting AI.

Zhaojuan Bianny Bian is an engineering manager in Intel’s Software and Service Group, where she focuses on big data cluster modeling to provide services in cluster deployment, hardware projection, and software optimization. Bianny has more than 10 years of experience in the industry with performance analysis experience that spans big data, cloud computing, and traditional enterprise applications. She holds a master’s degree in computer science from Nanjing University in China.

Presentations

Planning your SQL-on-Hadoop cluster for a multiuser environment with heterogeneous and concurrent query workloads Session

Many challenges exist in designing an SQL-on-Hadoop cluster for production in a multiuser environment with heterogeneous and concurrent query workloads. Jun Liu and Zhaojuan Bian draw on their personal experience to address these challenges, explaining how to determine the right size of your cluster with different combinations of hardware and software resources using a simulation-based approach.

Sarah Bird is a software engineer at Continuum Analytics. She has been a core Bokeh developer since 2015 and has given numerous talks and tutorials on Bokeh. Previously, she worked at Aptivate as a full stack web developer building IT solutions for the international development sector. She has worked in a variety of sectors from systems engineering for ejection seats to mobile health and data collection in Pakistan. Sarah holds a master’s degree in mechanical engineering from Cambridge University and a masters of science in technology and policy from the Massachusetts Institute of Technology.

Presentations

Interactive data applications in Python Tutorial

Bryan Van de Ven and Sarah Bird demonstrate how to build intelligent apps in a week with Bokeh, Python, and optimization.

Johan Bjerke is a senior sales engineer at Splunk. Johan has a decade of experience in IT and data management, working both in startups and with Splunk. He has had an impressive career at Splunk and was awarded global technical rookie in his first year and sales engineer of the year in the UK the following year. He is an active contributor to the Splunk community and has created one of the most popular Splunk apps, Splunk App for Web Analytics. Johan is a CISSP and holds an MSc in industrial engineering from Lund University.

Presentations

From data to insights using analytics Session

Machine data is growing at an exponential rate, and a key driver for this growth is the Internet of Things (IoT) revolution. Johan Bjerke explains how to find value in and make use of the unstructured machine data that plays an important role in the new connected world.

Ryan Blue is an engineer on Netflix’s Big Data Platform team. Before Netflix, Ryan was responsible for the Avro and Parquet file formats at Cloudera. He is also the author of the Analytic Data Storage in Hadoop series of screencasts from O’Reilly.

Presentations

Parquet performance tuning: The missing guide Session

Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet's features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he's learned, creating the missing guide you need.

Ron Bodkin is technical director for applied artificial intelligence at Google, where he helps Global Fortune 500 enterprises unlock strategic value with AI, acts as executive sponsor for Google product and engineering teams to deliver value from AI solutions, and leads strategic initiatives working with customers and partners. Previously, Ron was vice president and general manager of artificial intelligence at Teradata; the founding CEO of Think Big Analytics (acquired by Teradata in 2014), which provides end-to-end support for enterprise big data, including data science, data engineering, advisory and managed services, and frameworks such as Kylo for enterprise data lakes; vice president of engineering at Quantcast, where he led the data science and engineer teams that pioneered the use of Hadoop and NoSQL for batch and real-time decision making; founder of enterprise consulting firm New Aspects; and cofounder and CTO of B2B applications provider C-Bridge. Ron holds a BS in math and computer science with honors from McGill University and a master’s degree in computer science from MIT.

Presentations

Driving open source adoption within the enterprise Keynote

There’s been much discussion on open source versus commercial; CIOs and CTOs are increasingly interested in solutions that blend the benefits of both worlds. Ron Bodkin explains how Teradata drives open source adoption inside enterprises through a range of initiatives: direct contributions to open source projects, building orchestration software, and providing technical expertise.

Michelle Bonat is passionate about leveraging data and technology to drive business value. Michelle is currently the CEO and cofounder of Data Simply, a Silicon Valley-based software developer that is redefining the way financial professionals get insights from data. A software engineer, Michelle is a former executive at Oracle, where she ran financial web application development globally. She is also an ex-banker with an MBA from Kellogg. As a result, she deeply understands the challenges of financial professionals and knows how to solve them through technology.

Presentations

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Open the black box: An executive guide to making unstructured data work in finance FinData

You need to think about financial data differently to solve the most pressing challenges for your organization. But how do you get to data-driven finance and risk using unstructured data? Michelle Bonat explains how unstructured data (words and text) can be used to solve critical challenges in finance and unlock opportunities.

Presentations

Adventures from the frontlines of data for good Session

JeanCarlo Bonilla, Susan Sun, and Caitlin Augustin explore how DataKind volunteer teams navigate the road to social impact by automating evidence collection for conservationists and helping expand the reach of mobile surveys so that more voices can be heard.

Alex Bordei has been developing infrastructure products for over nine years. Before becoming Bigstep’s Product Manager, he was one of the core developers for Hostway Corporation’s provisioning platform. He then focused on defining and developing products for Hostway’s EMEA market and was one of the pioneers of virtualization in the company. After successfully launching two public clouds based on VMware software, he created the first prototype of Bigstep’s Full Metal Cloud in 2011. He now focuses on guaranteeing that the Full Metal Cloud is the highest performance cloud in the world, for big data applications. Twitter: @alexandrubordei

Presentations

Building data lakes in the cloud Session

Alex Bordei walks you through the steps required to build a data lake in the cloud and connect it to on-premises environments, covering best practices in architecting cloud data lakes and key aspects such as performance, security, data lineage, and data maintenance. The technologies presented range from basic HDFS storage to real-time processing with Spark Streaming.

Adam Bordelon is a distributed systems engineer at Mesosphere and an Apache Mesos committer. Before joining Mesosphere, Adam was lead developer on the Hadoop core team at MapR Technologies, developed distributed systems for personalized recommendations at Amazon, and rearchitected the LabVIEW compiler at National Instruments. He holds a master’s degree from Rice University, where he built a tool to analyze supercomputer performance data for bottlenecks and anomalies.

Presentations

Elastic data services on Mesos via Mesosphere’s DC/OS Session

Adam Bordelon and Mohit Soni demonstrate how projects like Apache Myriad (incubating) can install Hadoop on Mesosphere DC/OS alongside other data center-scale applications, enabling efficient resource sharing and isolation across a variety of distributed applications while sharing the same cluster resources and hence breaking silos.

Vinayak Borkar is the CTO of X15 Software, Inc. Previously, he was a PhD candidate at UC Irvine, where he worked on big data and contributed to the Hyracks Open Source Big Data Project. Prior to going back to school, Vinayak spent eight years building data-management software.

Presentations

Rethinking operational data stores on Hadoop Session

Starting from first principles, Vinayak Borkar defines the requirements for a modern operational data store and explores some possible architectures to support those requirements.

David Boyle is passionate about helping businesses to build analytics-driven decision making to make quicker, smarter and bolder decisions. He leads strategy and insights at MasterClass, working with the likes of Stephen Curry, Gordon Ramsay and Martin Scorsese to help people around the world to be able to learn from the greatest in their field. He has previously built global analytics and insight capabilities for a number of the leading entertainment businesses in the world covering television (the BBC), book publishing (HarperCollins Publishers) and the music industry (EMI Music), helping to shift each organization’s decision making at all levels, from content investment to product and brand development. He builds on experiences working to build analytics for global Retailers as well as political campaigns in the US and UK, in philanthropy and in strategy consulting.

Presentations

Catchy content: What makes TV content work? Data Case Studies

BBC Worldwide has a vast catalogue of content. David Boyle explains how data helps the BBC determine which countries a new show is best suited for—and which short-form content will be most engaging in promoting those shows—as he shares successes, failures, and frustrations from the BBC's latest work using predictive analytics, building a content genome, quant research, and social media monitoring.

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Claudiu Branzan is the director of data science at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies Session

David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Presentations

Making real-time analytics on the data lake a reality Session

Data lakes provide large-scale data processing and storage at low cost but struggle to deliver real-time analytics without investment in large clusters. If you need subsecond analytic response on streaming data, consider a GPU database. Amit Vij and Mark Brooks outline the dramatic performance benefits a GPU database offers and explain how to integrate it with Hadoop.

Kurt leads the Data Platform team at Netflix. His group architects and manages the technical infrastructure underpinning the company’s analytics. The Netflix data platform includes various big data technologies (e.g., Hadoop, Spark, and Presto), Netflix open sourced applications and services (e.g., Genie and Lipstick), and traditional BI tools (e.g., Tableau and Teradata).

Presentations

Office Hour with Kurt Brown (Netflix) Office Hours

Ask Kurt about Netflix’s data platform and what the future holds.

The Netflix data platform: Now and in the future Session

The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds.

Einat Burshtine is the head of the data and analytics organization for infrastructure within Credit Suisse, where she is responsible for strategy and implementation of all data-related products for infrastructure. Einat also leads big data initiatives, logging management, CMDB, data governance, and Visualization tools. Previously, she headed the infrastructure architecture team in the US, developing the firm’s strategy for infrastructure, including end user experience monitoring, systems management, and big data analytics. Einat has 18 years of experience across broad functional areas within the IT industry. Prior to joining Credit Suisse, she worked at Oracle in New York managing the business operations of the development organization in global IT and led strategic initiatives in defining, monitoring, and improving operations and processes—leading to improved efficiencies and cost reduction. Prior to that, Einat managed the northeast territory of Oracle’s Premium Support Services. Einat holds a bachelor’s degree in economics and management from Tel Aviv University.

Presentations

Why is this disruption different from all other disruptions? Hadoop as a game changer in financial services Session

What's the point at which Hadoop tips from a Swiss-army knife of use cases to a new foundation that rearranges how the financial services marketplace turns data into profit and competitive advantage? This panel of expert practitioners looks into the near future to see if the inflection point is at hand.

Mar Cabra is the head of the Data and Research unit at the International Consortium of Investigative Journalists, which produces the organization’s key data work and also develops tools for better collaborative investigative journalism. Mar fell in love with data while a Fulbright scholar and fellow at the Stabile Center for Investigative Journalism at Columbia University. Since then, she’s promoted data journalism in her native Spain, cocreating the first ever master’s degree in investigative reporting, data journalism, and visualization as well as the national data journalism conference, which gathers more than 500 people every year. She previously worked in television (BBC, CCN+, and LaSexta Noticias), and her work has been featured in the International Herald Tribune, The Huffington Post, PBS, El País, and El Mundo or El Confidencial, among others. In 2012, she received the Spanish Larra Award, given to the country’s most promising journalist under 30.

Presentations

Connecting the dots through leaked and public data FinData

Offshore leaks, Lux leaks, Swiss leaks, Bahamas leaks, and the Panama Papers—all have one thing in common: they were all uncovered by the International Consortium of Investigative Journalists. Giannina Segnini and Mar Cabra explain how this global network of muckrakers uses technology to deal with big data and find cross-border stories that have worldwide impact.

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

The tech behind the biggest journalism leak in history Keynote

The Panama Papers investigation revealed the offshore holdings and connections of dozens of politicians and prominent public figures around the world and led to high-profile resignations, police raids, and official investigations. Almost 500 journalists had to sift through 2.6 terabytes of data—the biggest leak in the history of journalism. Mar Cabra explains how technology made it all possible.

Yishay Carmiel is the founder of IntelligentWire, a company that develops and implements industry-leading deep learning and AI technologies for automatic speech recognition (ASR), natural language processing (NLP) and advanced voice data extraction, and the head of Spoken Labs, the strategic artificial intelligence and machine learning research arm of Spoken Communications. Yishay and his teams are currently working on bleeding-edge innovations that make the real-time customer experience a reality—at scale. Yishay has nearly 20 years’ experience as an algorithm scientist and technology leader building large-scale machine learning algorithms and serving as a deep learning expert.

Presentations

Recent advances in applications of deep learning for text and speech Session

Deep learning has taken us a few steps further toward achieving AI for a man-machine interface. However, deep learning technologies like speech recognition and natural language processing remain a mystery to many. Yishay Carmiel reviews the history of deep learning, the impact it's made, recent breakthroughs, interesting solved and open problems, and what's in store for the future.

Jeff Carpenter is a technology evangelist at DataStax, where he leverages his background in system architecture, microservices and Apache Cassandra to help empower developers and operations engineers build distributed systems that are scalable, reliable, and secure. Jeff has worked on projects ranging from a complex battle planning system in an austere network environment, to a cloud-based hotel reservation system and is the author of Cassandra: The Definitive Guide, 2nd Edition.

Presentations

Data modeling for microservices with Cassandra and Spark Session

Jeff Carpenter describes how data modeling can be a key enabler of microservice architectures for transactional and analytics systems, including service identification, schema design, and event streaming.

Connor Carreras is Trifacta’s manager for customer success in the Americas, where she helps customers use cutting-edge data wrangling techniques in support of their big data initiatives. Connor brings her prior experience in the data integration space to help customers understand how to adopt self-service data preparation as part of an analytics process. She is a coauthor of the O’Reilly book Principles of Data Wrangling.

Presentations

Top data wrangling use cases in enterprise analytics Session

Connor Carreras offers an in-depth review of the most popular use cases for data wrangling solutions among enterprise organizations, drawing on real customer deployments to explain how data wrangling has enabled them to accelerate analysis and uncover new sources of business value.

Joe Caserta is president of Caserta Concepts, an award-winning New York-based innovation consulting and technology implementation firm specializing in big data analytics, data warehousing, business intelligence solutions, and helping clients maximize data value. A recognized big data strategy consultant, author, and educator, Joe is coauthor of the best-selling book The Data Warehouse ETL Toolkit (Wiley, 2004), a contributor to industry publications, and frequent keynote speaker and expert panelist at industry conferences and events. He also serves on the advisory boards of financial and technical institutions and is the organizer and host of the Big Data Warehousing Meetup group in NYC.

Presentations

Path-to-purchase analytics using a data lake and Spark Session

Joe Caserta explores how a leading membership interest group is utilizing a data lake to track its members’ path-to-purchase touch points across multiple channels by matching and mastering individuals using Spark GraphFrames and stitching together website, marketing, email, and transaction data to discover the most effective way to attract new members and retain existing high-value members.

Michael Casey is a writer and researcher in the fields of economics, finance, and digital-currency technology. He was recently named senior advisor for blockchain opportunities at the MIT Media Lab’s new Digital Currency Initiative. This follows 23 years as a journalist, the last 18 of which Michael spent at the Wall Street Journal. In a career spanning Perth, Bangkok, Jakarta, Buenos Aires, and New York, Michael has covered currencies, bonds, equities, and economic policy. Most recently, he was the Wall Street Journal’s senior columnist covering global economics and markets. Michael is the coauthor of The Age of Cryptocurrency: How Bitcoin and Digital Money Are Challenging the Global Economic Order, with his Wall Street Journal colleague Paul Vigna, and wrote the regular Wall Street Journal BitBeat column on digital currency developments. He has worked as a host of Wall Street Journal-sponsored online TV news programs and has been a frequent commentator on CNBC, the BBC, Fox Business, CNN, and a variety of other broadcast media. Michael authored two earlier books: The Unfair Trade, a book on the global dimensions of the financial crisis, and Che’s Afterlife, about the international impact of Alberto Korda’s iconic image of Che Guevara. He is a graduate of the University of Western Australia and has a master’s degree from Cornell University.

Presentations

Distributed trust for decentralized financial future FinData

We're headed toward a decentralized economy, where our finances are managed by investment algorithms, big data analytics, IoT-linked devices, and crowdfunding marketplaces. But its potential won't be realized until we overcome a core obstacle: trust. Michael Casey explains why blockchain technology, with its decentralized trust architecture, is the platform that makes everything else possible.

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Tanya Cashorali is the founding partner of TCB Analytics, a Boston-based data consultancy. Previously, she worked as a data scientist at Biogen. Tanya started her career in bioinformatics and has applied her experience to other data-rich verticals such as telecom, finance, and sports. She brings over 10 years of experience using R in data scientist roles as well as managing and training data analysts, and she’s helped grow a handful of Boston startups.

Presentations

CANCELED: How to hire and test for data skills: A one-size-fits-all interview kit Session

Given the recent demand for data analytics and data science skills, adequately testing and qualifying candidates can be a daunting task. Interviewing hundreds of individuals of varying experience and skill levels requires a standardized approach. Tanya Cashorali explores strategies, best practices, and deceptively simple interviewing techniques for data analytics and data science candidates.

Diane Chang is a distinguished data scientist at Intuit, where she has worked on many interesting business problems that depend on machine learning, behavioral analysis, and risk prediction. Previously, Diane worked for a small “mathematical” consulting firm and a startup in the online advertising space and was a stay-at-home mom for six years. She holds a PhD in operations research from Stanford.

Presentations

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Using big data for small business financing FinData

Almost 700,000 small businesses fail each year—many because they cannot secure critical financing when they need it. Banks would lend more if they could better distinguish the good risks from the bad. Diane Chang explains how a small team used big data to turn a 70% loan rejection rate into a 70% acceptance rate and solve a critical problem for small businesses.

Joanne Chen joined Truveris as the first data scientist of the company and built the data science practice leveraging 1.5 billion pharmacy claims. In her current role as the vice president of data science, her responsibilities include full life-cycle management of products and data-driven R&D efforts. Prior to joining Truveris, Joanne was a statistics professional at Liberty Mutual focusing on personalized and targeted distribution strategy of auto products. Joanne has a PhD degree from Harvard in evolutionary biology and a master’s degree in statistics.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Precision and paragon: How holistic spend modeling drives efficiency in healthcare Data Case Studies

Joanne Chen explores how data science powers business at Truveris, a health IT startup disrupting the prescription benefits industry, and discusses Truveris's OneRx National Drug Index, the first index that provides a real-time holistic view of prescription drug prices.

Slava Chernyak is a senior software engineer at Google. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Google Cloud Dataflow Streaming from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.

Presentations

Watermarks: Time and progress in Apache Beam (incubating) and beyond Session

Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications.

Ewen Cheslack-Postava is an engineer at Confluent building a stream data platform based on Apache Kafka to help organizations reliably and robustly capture and leverage all their real-time data. Ewen received his PhD from Stanford University, where he developed Sirikata, an open source system for massive virtual environments. His dissertation defined a novel type of spatial query giving significantly improved visual fidelity and described a system for efficiently processing these queries at scale.

Presentations

Ask me anything: Apache Kafka AMA

Join Apache Kafka cocreator and PMC chair Jun Rao and Apache Kafka committer and architect of Kafka Connect Ewen Cheslack-Postava for a Q&A session about Apache Kafka. Bring your questions about Kafka internals or key considerations for developing your data pipeline and architecture, designing your applications, and running in production with Kafka.

When one data center is not enough: Building large-scale stream infrastructures across multiple data centers with Apache Kafka Session

You may have successfully made the transition from single machines and one-off solutions to large, distributed stream infrastructures in your data center. But what if one data center is not enough? Ewen Cheslack-Postava explores resilient multi-data-center architecture with Apache Kafka, sharing best practices for data replication and mirroring as well as disaster scenarios and failure handling.

Brian Clapper is a senior instructor and curriculum developer at Databricks. Brian has more than 32 years’ experience as a software developer. Brian has worked for a stock exchange, the US Navy, a large software company, several startups, and small companies and, most recently, as an independent consultant and trainer for 7 years. Brian is fluent in many languages, including Scala, Java, Python, Ruby, C#, and C. In addition, he is highly familiar with current web application technologies, including frameworks like Play!, Ruby on Rails, and Django, and frontend technologies like jQuery, EmberJS, and AngularJS. Brian founded the Philly Area Scala Enthusiasts in 2010 and, since 2011, has been a co-organizer of the Northeast Scala Symposium; he was also co-organizer of Scalathon in 2011 and 2012. He maintains a substantial GitHub repository of open source projects and is fond of saying that even after many years as a software developer, programming is still one of his favorite activities.

Presentations

Spark foundations: Prototyping Spark use cases on Wikipedia datasets Training

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Brian Clapper employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

Spark foundations: Prototyping Spark use cases on Wikipedia datasets (Day 2) Training day 2

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Brian Clapper employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

Ira Cohen is a cofounder of Anodot and its chief data scientist, where he is responsible for developing and inventing its real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Analytics for large-scale time series and event data Session

Time series and event data form the basis for real-time insights about the performance of businesses such as ecommerce, the IoT, and web services, but gaining these insights involves designing a learning system that scales to millions and billions of data streams. Ira Cohen outlines a system that performs real-time machine learning and analytics on streams at massive scale.

Alex Constandache is a senior data scientist in the Thomson Reuters Lab in Boston. Alex is a physicist by training and did his doctoral thesis research in nonlinear dynamics at University of Rochester. After a brief academic stint working in computational fluid dynamics and teaching astrophysics at the University of Virginia, he became a software engineer working in the areas of machine learning and information retrieval. Before joining Thomson Reuters, Alex worked in the Search and Recommendations group at Wayfair and in the Data Science group at Millennial Media (now part of AOL). When he is not busy crunching data, he likes to run.

Presentations

A data-driven approach to the US presidential election Session

Amir Hajian, Khaled Ammar, and Alex Constandache offer an approach to mining a large dataset to predict the electability of hypothetical candidates in the US presidential election race, using machine learning, natural language processing, and deep learning on an infrastructure that includes Spark and Elasticsearch, which serves as the backbone of the mobile game White House Run.

Sylvain Corlay is a quantitative researcher at QuantStack and an active contributor to the open source Project Jupyter. Sylvain also teaches finance at NYU and Columbia. He holds a PhD in quantitative finance from Université Pierre et Marie Curie.

Presentations

JupyterLab: The evolution of the Jupyter Notebook Session

Brian Granger, Sylvain Corlay, and Jason Grout offer an overview of JupyterLab, the next-generation user interface for Project Jupyter that puts Jupyter Notebooks within a powerful user interface that allows the building blocks of interactive computing to be assembled to support a wide range of interactive workflows used in data science.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Inbox is the Trojan horse of AI Keynote

When Hollywood portrays artificial intelligence, it's either a demon or a savior. But the reality is that AI is far more likely to be an extension of ourselves. Strata program chair Alistair Croll looks at the sometimes surprising ways that machine learning is insinuating itself into our everyday lives.

Thursday keynotes Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Sabrina Dahlgren is a director in charge of strategic analysis at Kaiser Permanente. Her expertise ranges from statistics and economics to project management and computer science. Sabrina has 20 years’ total work experience in leadership and analytical roles such as vice president of marketing and product development and CRM manager and customer segmentation in technology companies including Vodafone, among others. Sabrina has twice won the Innovation Award at Kaiser, most recently in the category of broadly applicable technology for big data analytics.

Presentations

Big data in healthcare Session

While other industries have embraced the digital era, healthcare is still playing catch-up. Kaiser Permanente has been a leader in healthcare technology and first started using computing to improve healthcare results in the 1960s. Taposh Roy, Rajiv Synghal, and Sabrina Dahlgren offer an overview of Kaiser’s big data strategy and explain how other organizations can adopt similar strategies.

Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine, Gobblin, a data lifecycle management platform for Hadoop, WhereHows, a data discovery and lineage platform, and Dali, a data virtualization layer for Hadoop.

Presentations

Architecting for change: LinkedIn's new data ecosystem Session

Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes, such as client-side activity tracking, a unified reporting platform, and data virtualization techniques to simplify migration, that enable LinkedIn to roll out future product innovations with minimal downstream impact.

Michael Dauber is a general partner at Amplify Partners. Previously, Mike spent over six years at Battery Ventures, where he led early-stage enterprise investments on the West Coast, including Battery’s investment in a stealth security company that is also in Amplify’s portfolio. Mike has served on the boards of a number of companies, including Continuuity, Duetto, Interana, and Platfora. Mike’s investments include Splunk and RelateIQ, which was recently acquired by Salesforce. Mike began his career as a hardware engineer at a startup and held product, business development, and sales roles at Altera and Xilinx. Mike is a frequent speaker at conferences and is on the advisory board of both the O’Reilly Strata Conference and SXSW. He was named to Forbes magazine’s 2015 Midas Brink List. Mike holds a BS in electrical engineering from the University of Michigan in Ann Arbor and an MBA from the University of Pennsylvania’s Wharton School.

Presentations

Where's the puck headed? Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road. Join us to hear about the trends that everyone is seeing and areas for investment that they find exciting.

Tom de Godoy is co-founder and CTO of DataRobot, and has over 10 years of Data Science and engineering experience. Previously, he was Senior Director of Research and Modeling at Travelers Insurance where he managed a team of Data Scientists on applications in pricing, claims and customer behavior for various insurance products. Tom has been ranked as high as 20th in the world on the data science competition platform Kaggle.com, which boasts more than 500,000 registered Data Scientists.

Presentations

Data science for executives Session

In today's world, executives need to be the drivers for data science solutions. Data analysis has moved from the domain of data scientists to the forefront of core strategic initiatives. Are you empowering your team to identify and execute on every opportunity to optimize business with machine learning? In this session, you will learn how executives are transforming business with machine learning.

Alexander Dean is cofounder and technical lead at Snowplow Analytics, an enterprise-strength open source event analytics platform.

Presentations

What Crimean War gunboats teach us about the need for schema registries Session

In 1853, Britain’s workshops built 90 new gunboats for the Royal Navy in just 90 days—an astonishing feat of engineering made possible by industrial standardization. Snowplow's Alexander Dean argues that data-sophisticated corporations need a new standardization of their own, in the form of schema registries like Confluent Schema Registry or Snowplow’s own Iglu.

Danielle Dean is a principal data scientist lead at Microsoft in the Algorithms and Data Science Group within the Artificial Intelligence and Research Division, where she leads a team of data scientists and engineers building predictive analytics and machine learning solutions with external companies utilizing Microsoft’s Cloud AI Platform. Previously, she was a data scientist at Nokia, where she produced business value and insights from big data through data mining and statistical modeling on data-driven projects that impacted a range of businesses, products, and initiatives. Danielle holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill, where she studied the application of multilevel event history models to understand the timing and processes leading to events between dyads within social networks.

Presentations

Breeding data scientists: A four-year study Session

At Strata + Hadoop World 2012, Amy O'Connor and her daughter Danielle Dean shared how they learned and built data science skills at Nokia. This year, Amy and Danielle explore how the landscape in the world of data science has changed in the past four years and explain how to be successful deriving value from data today.

Evaluating models for a needle in a haystack: Applications in predictive maintenance Session

In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Danielle Dean and Shaheen Gauher discuss the various ways of building and evaluating models for such data.

Kaushik Deka is the director of engineering at Novantas. He has 15 years’ experience in big data engineering in the enterprise and is an expert in Spark and Hadoop architectures. Kaushik holds a master of computer science from the University of Missouri, a master of technology management from the Wharton School, and a master’s degree in financial engineering from Carnegie Mellon.

Presentations

How a Spark-based feature store can accelerate big data adoption in financial services Session

Kaushik Deka and Phil Jarymiszyn discuss the benefits of a Spark-based feature store, a library of reusable features that allows data scientists to solve business problems across the enterprise. Kaushik and Phil outline three challenges they faced—semantic data integration within a data lake, high-performance feature engineering, and metadata governance—and explain how they overcame them.

Anthony Dina serves as the director of enterprise technologists at Dell, Inc., where he leads a team of solutions architects with expertise in big data and application acceleration to work with customers on how to transform IT into better business outcomes. Anthony has 17 years in the IT industry and has held a number of executive director of strategy and director of solutions marketing titles. Some of his successes include ramping the blades business to number one, launching the first Opteron server, and championing virtual IO solutions, all within 10 years. Anthony holds a masters of business administration from the University of St. Thomas and a masters of fine art from Cranbrook Academy of Art, as well as certifications for ITIL v3 Foundation and services strategy.

Presentations

Enhancing the customer experience when driving Hadoop adoption Session

Mastercard's Nick Curcuru hosts an interactive fireside chat with Anthony Dina from Dell to explore how the flexibility, scalability, and agility of Hadoop big data solutions allow one of the world’s leading organizations to innovate, enable, and enhance the customer experience while still expanding emerging opportunities.

Renee DiResta is the vice president of business development at Haven, a private marketplace for booking ocean freight shipments. Previously, Renee was a principal at seed-stage VC fund O’Reilly AlphaTech Ventures (OATV) and spent seven years as a trader at Jane Street Capital, a quantitative proprietary trading firm in New York City. Renee is interested in improving liquidity and transparency in private markets and enjoys investing in and advising hardware startups.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Of market makers and middlemen: How data is transforming global trade Data Case Studies

Financial markets have evolved from small, local marketplaces to centralized, tech-mediated exchanges, allowing better price discovery, faster execution, and decreased risk. However, the global transportation industry has lagged. Renee DiResta explores how data and tech is transforming container shipping and discusses Haven's data-driven automated freight procurement model.

Jake Dolezal is a practice lead for McKnight Consulting Group Global Services. Jake has over 17 years of experience in information management, with expertise in business intelligence, analytics, data warehousing, statistics, data modeling and integration, data visualization, master data management, and data quality across a broad array of industries, including healthcare, education, government, manufacturing, engineering, hospitality, and gaming. Previously, Jake was the senior director of information management at the Choctaw Nation of Oklahoma—the third-largest Native American tribe in the United States, with over 200,000 members worldwide—where he championed and developed an enterprise-wide information management initiative from the ground up across the organization’s commercial, government, healthcare, social service, and education divisions. He was also involved with the organization’s core CRM and ERP systems. Jake is the author of two books due to be published this year. He holds a PhD in information management from Syracuse University and is a certified business intelligence professional through TDWI with an emphasis in data analysis. He is also a certified leadership coach and has helped clients accelerate their careers and earn several executive promotions.

Presentations

Gaining extreme agility and performance using a Spark-free approach to data management Session

Jake Dolezal shares research into the performance of data quality and data management workloads on Hadoop clusters. Jake discusses a YARN-based approach to data management and outlines highly effective IT resource utilization techniques to achieve extreme agility for organizations and performance gains in Hadoop.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Many Hadoop clusters lack even basic security controls. Michael Yoder, Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You'll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Michael Dowd is a data scientist at DataKind working on a long-term project using data science to prevent traffic fatalities in cities nationwide. A recent graduate of MIT with master’s degrees in city planning and transportation, Michael is a transportation planner, modeler, coder, GIS expert, and civic data enthusiast. He hails from Seattle, WA, but has lived most of his adult life in NYC. While attending MIT, he worked on modeling the potential impacts of climate change and inundation on the Boston metro region. He loves writing code, civic data, studying transportation, and all things geospatial.

Presentations

A collaboration in civic tech: Improving traffic safety nationwide Data Case Studies

The global movement Vision Zero aims to reduce traffic fatalities and severe injuries to zero. Erin Akred and Michael Dowd explore a partnership between Microsoft, a team of DataKind data scientists, government officials, and researchers that has been working to leverage newly available datasets to inform cities’ efforts nationwide to reduce traffic-related deaths and severe injuries to zero.

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Ted Dunning is chief applications architect at MapR Technologies. He’s also a board member for the Apache Software Foundation, a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects, and a mentor for various incubator projects. Ted contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library. He also designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date. He holds a PhD in computing science from the University of Sheffield. He is on Twitter as @ted_dunning.

Presentations

Fast cars, big data: How streaming data can help Formula 1 Session

Modern cars produce data. Lots of data. And Formula 1 cars produce more than their fair share. Ted Dunning presents a demo of how data streaming can be applied to the analytics problems posed by modern motorsports. Although he won't be bringing Formula 1 cars to the talk, Ted demonstrates a physics-based simulator to analyze realistic data from simulated cars.

Office Hour with Ted Dunning (MapR Technologies) Office Hours

Join Ted for discussion on data streaming—and how it can be applied to the analytics problems posed by modern motorsports. (Ask him about his physics-based simulator that analyzes data from simulated cars.)

Michael Eacrett is vice president of product management leading SAP’s in-memory distributed computing platform, SAP HANA Vora, big data and IoT platforms, and enterprise information management products, where he defines product strategy, manages product requirements and the partner ecosystem, and enables product go-to-market. Most recently, Michael built up and led the SAP HANA PM team developing the HANA product and business from initial launch to over $2 billion in product sales. Michael has over 20 years of industry experience in product management, strategic consulting, and implementation in both America and Europe.

Presentations

Big data governance: Making big data an enterprise-class citizen Session

Big data is a critical part of the enterprise data fabric and must meet the critical enterprise criteria of correctness, quality, consistency, compliance, and traceability. Michael Eacrett explains how companies are using big data infrastructures, asynchronously and in real time, to actively solve information governance and data-quality challenges.

Office Hour with Michael Eacrett (SAP) Office Hours

Michael Eacrett will be available to discuss how companies are using big data infrastructures, asynchronously and in real time, to actively solve information governance and data quality challenges.

Emil Eifrem is CEO of Neo Technology and cofounder of Neo4j, the world’s leading graph database. Committed to sustainable open source, he guides Neo along a balanced path between free availability and commercial reliability. Before founding Neo, Emil was the CTO of Windh AB, where he headed the development of highly complex information architectures for enterprise content management systems.

Presentations

Using graph databases to operationalize insights from big data Session

Tim Williamson and Emil Eifrem explain how organizations can use graph databases to operationalize insights from big data, drawing on the real-life example of Monsanto’s use of graph databases to conduct real-time graph analysis of the company’s data to transform the business in ways that were previously impossible.

Justin Erickson is a senior director of product management leading Cloudera’s platform team, which is responsible for the components in Cloudera Distribution, including Hadoop (CDH) above storage. Previously, he led the high-availability and disaster-recovery areas of Microsoft SQL Server.

Presentations

BI and SQL analytics with Hadoop in the cloud Session

Henry Robinson and Justin Erickson explain how to best take advantage of the flexibility and cost-effectiveness of the cloud with your BI and SQL analytic workloads using Apache Hadoop and Apache Impala (incubating), covering the architectural considerations, best practices, tuning, and functionality available when deploying or migrating BI and SQL analytic workloads to the cloud.

Andy Eschbache is a map scientist at CartoDB, where he focuses his efforts on experimenting with CartoDB’s stack and trying to push it in new directions. Andy also runs CartoDB’s Map Academy, an open source education project aiming to teach the skills required to be a successful web mapper through lessons on data visualization, external API integrations, PostGIS/SQL in the cloud, and much more.

Presentations

Designing a location intelligence platform for everyone by integrating data, analysis, and cartography Session

Geospatial analysis can provide deep insights into many datasets. Unfortunately the key tools to unlocking these insights—geospatial statistics, machine learning, and meaningful cartography—remain inaccessible to nontechnical audiences. Stuart Lynn and Andy Eschbacher explore the design challenges in making these tools accessible and integrated in an intuitive location intelligence platform.

Susan Etlinger is an industry analyst at Altimeter. Her research focuses on the impact of artificial intelligence, data and advanced technologies on business and culture and is used in university curricula around the world. Susan’s TED talk, “What Do We Do With All This Big Data?,” has been translated into 25 languages and has been viewed more than 1.2 million times. She is a sought-after keynote speaker and has been quoted in such media outlets as the Wall Street Journal, the BBC, and the New York Times.

Presentations

Helping computers help us see Session

The history of the digital age is being written in photographs. To innovate in the visual age, we have to crack the visual code. Susan Etlinger explores why the ability to understand why one photo resonates and one doesn’t can make or break reputations, spark new products or lines of business, and make or save millions of dollars.

Moty Fania owns development and architecture in the advanced analytics group within Intel IT. With over 13 years of experience in analytics, data warehousing, and decision support solutions, Moty drives the overall technology and architectural roadmap for big data analytics in Intel IT. Moty is also the architect behind Intel’s IoT big data analytics platform. He holds a bachelor’s degree in computer science and economics and a master’s degree in business administration from Ben-Gurion University in Israel.

Presentations

Stream analytics in the enterprise: A look at Intel’s internal IoT implementation Session

Moty Fania shares Intel’s IT experience implementing an on-premises IoT platform for internal use cases. The platform was designed as a multitenant platform with built-in analytical capabilities and based on open source big data technologies and containers. Moty highlights the lessons learned from this journey with a thorough review of the platform’s architecture.

Patricia Florissi is vice president and global chief technology officer for sales at Dell EMC, where she helps define mid- and long-term technology strategy representing the needs of the broader EMC ecosystem in EMC strategic initiatives. Patricia also acts as the liaison between EMC and customers and partners to foster stronger alliances and deliver higher value to EMC clientele. Patricia holds the honorary title of EMC distinguished engineer. She is the creator, author, narrator, and graphical influencer of the educational video series EMC Big Ideas on emerging technologies and trends, which accelerates and expands technical thought leadership in EMC using innovative learning methodologies in a fun, easy way without talking about products to both internal and external audiences. Previously at EMC, Patricia was the CTO for Ionix, where she was responsible for defining and communicating EMC’s medium- to long-term vision for delivering solutions to automate the management of information infrastructure resources; the strategic initiative leader for governance, risk, and compliance (GRC), where she was responsible for leading the research, design, execution, and communication of EMC’s GRC vision and strategy; Americas CTO for sales; and Americas and Europe, Middle East, and Africa (EMEA) CTO. Before joining EMC, Patricia was the vice president of advanced solutions at Smarts, where she was responsible for defining the strategy that Smarts would take in bringing solutions to market to address the challenges introduced by emerging technologies and led the research, design, and first release of over half a dozen products that have been driving millions of dollars in revenue and still remain in the market today.

Patricia has written articles on the impact of big data in accelerating innovation and for the 2014 World Economic Forum (WEF) Global IT Report Newsletters and has published in periodicals including Computer Networks and IEEE Proceedings. She holds multiple patents. Patricia is a board member of the Columbia School of Engineering Board of Visitors, a board member of the Brazilian/American Chamber of Commerce, where she assists in fostering relations with Brazil, and the chairman of the advisory board for the Data Science Graduate Program at Worcester Polytechnic Institute, where she advises and assists in the school’s new program. She is also an active member and participant of the Americas Society/Council of the Americas organization. Patricia serves as a mentor for several groups both inside and outside of EMC. She sits as mentor and judge for the Boston-based Mass Challenge group, as well as the Boston Club for advancing women’s leaders. Patricia holds a PhD in computer science from Columbia University in New York, an MBA from the Stern Business School in New York University, and master’s and bachelor’s degrees in computer science from the Universidade Federal de Pernambuco, in Brazil.

Presentations

Modern analytics with Dell EMC Keynote

Data, your most precious commodity, is increasing at an alarming rate. At the same time, an emerging business imperative has made this data a component of your deepest insights, allowing you to focus on your business outcomes. Patricia Florissi explains why the recent formation of Dell EMC ensures that your analytics capabilities will be stronger than ever.

Jonathan Fritz is a senior product manager at Amazon EMR, a managed service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data using Hadoop, Spark, and Presto. Prior to Amazon, Jonathan was the founder and CEO of Eleven Media Group and performed research in organic chemistry and nanotechnology in the Maurer Group at Washington University in St. Louis. He holds an MBA from the Stanford Graduate School of Business and a bachelor’s degree in chemistry with minor in biology from Washington University in St. Louis. He received a certificate for accomplishment in entrepreneurship from the Skandalaris Center for Entrepreneurial Studies.

Presentations

Running Presto and Spark on AWS: From zero to insight in less than five minutes Session

Running Hadoop, Spark, and Presto can be as fast and inexpensive as ordering a latte at your favorite coffee shop. Jonathan Fritz explains how organizations are deploying these and other big data frameworks with Amazon Web Services (AWS) and how you too can quickly and securely run Spark and Presto on AWS. Jonathan shows you how to get started and shares best practices and common use cases.

Uma Maheswara Rao G is an Apache Software Foundation Member. An Apache Hadoop committer, a member of the Apache Hadoop PMC, and a long-term active contributor to the Apache Hadoop project. He is also a PMC member for the Apache BookKeeper project. Uma is a senior software engineer at Intel, where he is responsible for Apache HDFS open source development.

Presentations

Debunking HDFS erasure coding performance myths Session

The new erasure coding feature in Apache Hadoop (HDFS-EC) reduces the storage cost by ~50% compared with 3x replication. Zhe Zhang and Uma Maheswara Rao G present the first-ever performance study of HDFS-EC and share insights on when and how to use the feature.

Mariusz Gądarowski is an IT director at deepsense.io. Mariusz is a big data and machine-learning enthusiast with over 10 years of professional design and software development experience. He started his career as a software engineer developing storage and distributed computing solutions for big data. Soon, he became the CTO at Gemius, an Internet consulting company. Prior to joining deepsense.io, he managed teams working on cloud and networking systems at CodiLime. Mariusz holds master’s degrees in mathematics and computer science from the University of Warsaw.

Presentations

Neptune: A machine-learning platform for experiment management Session

Mariusz Gądarowski offers an overview of Neptune, deepsense.io’s new IT platform-based machine-learning experiment management solution for data scientists. Neptune enhances the management of machine-learning tasks such as dependent computational processes, code versioning, comparing achieved results, monitoring tasks and progress, sharing infrastructure among teammates, and many others.

Shankar Ganapathy is vice president and chief revenue officer at Paxata, where he leads all customer success, field, sales operations, and go-to-market teams. Prior to Paxata, Shankar was with MicroStrategy, where, as an early employee, he was integral to building out a highly effective go-to-market organization. Most recently, he was senior vice president in charge of the company’s Asia-Pacific and Japan operations and was instrumental in driving significant scale and growth in that business. Shankar also developed and grew MicroStrategy’s Channels business globally and also held leadership positions in the company’s Professional Services business.

Presentations

Citi, Standard Charter Bank, and Polaris: The modern information pipeline that fuels investigations of money laundering, fraud, and human trafficking Session

Join data experts from Citi, Standard Charter Bank, and Polaris for a panel discussion moderated by Shankar Ganapathy. Learn about the principles, technologies, and processes they have used to design a highly efficient information management pipeline architected around the Hadoop ecosystem.

François Garillot is a data scientist at Swisscom, where he works on curating and understanding telecommunications data through big data tools. Previously, François worked on Apache Spark Streaming’s reliability at Lightbend (formerly Typesafe). His interests include machine learning—especially online models, approximation and hashing techniques, control theory, and unsupervised time series analysis—skiing, sailing, and hunting for good cheese.

Presentations

Delivering near real-time mobility insights at Swisscom Session

Swisscom, the leading mobile service provider in Switzerland, also provides data-driven intelligence through the analysis of its mobile network. Its Mobility Insights team works to help administrators understand the flow of people through their location of interest. François Garillot explores the platform, tooling, and choices that help achieve this service and some challenges the team has faced.

Yael Garten is director of data science at LinkedIn, where she leads a team that focuses on understanding and increasing growth and engagement of LinkedIn’s 400 million members across mobile and desktop consumer products. Yael is an expert at converting data into actionable product and business insights that impact strategy. Her team partners with product, engineering, design, and marketing to optimize the LinkedIn user experience, creating powerful data-driven products to help LinkedIn’s members be productive and successful. Yael champions data quality at LinkedIn; she has devised organizational best practices for data quality and developed internal data tools to democratize data within the company. Yael also advises companies on informatics methodologies to transform high-throughput data into insights and is a frequent conference speaker. She holds a PhD in biomedical informatics from the Stanford University School of Medicine, where her research focused on information extraction via natural language processing to understand how human genetic variations impact drug response, and an MSc from the Weizmann Institute of Science in Israel.

Presentations

Architecting for change: LinkedIn's new data ecosystem Session

Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes, such as client-side activity tracking, a unified reporting platform, and data virtualization techniques to simplify migration, that enable LinkedIn to roll out future product innovations with minimal downstream impact.

Shaheen Gauher is a data scientist in information management and machine learning at Microsoft, where she develops end-to-end, data-driven advanced analytics solutions for external customers. She is passionate about data and science and uses machine learning to come up with key insights that generate value for better decisions and better business performance. A climate scientist by training, Shaheen received her PhD in earth, ocean, and atmospheric sciences with a focus on satellite retrievals.

Presentations

Evaluating models for a needle in a haystack: Applications in predictive maintenance Session

In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Danielle Dean and Shaheen Gauher discuss the various ways of building and evaluating models for such data.

Bas Geerdink is a programmer, scientist, and IT manager at ING, where he is responsible for the fast data systems that process and analyze streaming data. Bas has a background in software development, design, and architecture with broad technical experience from C++ to Prolog to Scala. His academic background is in artificial intelligence and informatics. Bas’s research on reference architectures for big data solutions was published at the IEEE conference ICITST 2013. He occasionally teaches programming courses and is a regular speaker at conferences and informal meetings.

Presentations

Hadoop and Spark at ING: An overview of the architecture, security, and business cases at a large international bank Session

Bas Geerdink offers an overview of the evolution that the Hadoop ecosystem has taken at ING. Since 2013, ING has invested heavily in a central data lake and data management practice. Bas shares historical lessons and best practices for enterprises that are incorporating Hadoop into their infrastructure landscape.

Colette Glaeser is a principal data strategist at Silicon Valley Data Science. With a proven track record in applying analytics to provide a competitive advantage, Colette brings over 20 years of experience in driving business development, customer insight, operational analysis, and continuous process improvement across a range of industries. She uses her full understanding of the types of business questions that surface in support of profitability growth initiatives as well as an arsenal of analytic tools, methods, and technologies to translate data into insights that are actionable.

Presentations

Developing a modern enterprise data strategy Tutorial

How do you reconcile the business opportunity of big data and data science with the sea of possible technologies? Fundamentally, data should serve the strategic imperatives of a business—those key aspirations that define an organization’s future vision. Edd Wilder-James and Colette Glaeser explain how to create a modern data strategy that powers data-driven business.

Scott Gnau is the CTO of Hortonworks, a company at the forefront of emerging connected data platforms, where he works intimately with leaders in the Fortune 1000 undergoing business transformation through real-time data. Scott has spent his entire career in the data industry; previously, he was president of Teradata Labs, where he provided visionary direction for research, development, and sales support activities related to Teradata integrated data warehousing, big data analytics, and associated solutions. He also drove the investments and acquisitions in Teradata’s technology related to the solutions from Teradata Labs. Scott holds a BSEE from Drexel University.

Presentations

Powering the future of data with connected data platforms Session

Scott Gnau provides unique insights into the tipping point for data, how enterprises are now rethinking everything from their IT architecture and software strategies to data governance and security, and the cultural shifts CIOs must grapple with when supporting a business using real-time data to scale and grow.

Joe Goldberg is the lead solutions marketing manager at BMC Software, where he helps BMC products leverage new technology to deliver market-leading solutions with a focus on workload automation and big data. Joe has more than 35 years of experience in the design, development, implementation, sales, and marketing of enterprise solutions to Global 2000 organizations.

Presentations

Accelerate EDW modernization with the Hadoop ecosystem Session

Joe Goldberg explores how companies like GoPro, Produban, Navistar, and others have taken a platform approach to managing their workflows; how they are using workflows to power data ingest, ETL, and data integration processing; how an end-to-end view of workflows has reduced issue resolution time; and how these companies are achieving success in their data warehouse modernization projects.

Brett Goldstein is a leader in enterprise architecture, big data analytics, and government technology with 15 years of experience in operations, management, and leadership in technical environments in both the public and private sector. Brett was recently named the inaugural recipient of the Fellowship in Urban Science at the University of Chicago’s Harris School of Public Policy. As a senior fellow in urban science, he will focus on issues of computation and public policy to inform better decision making in government. Previously, Brett was the commissioner and chief information officer of the Chicago Department of Innovation and Technology (DoIT), appointed by Mayor Rahm Emanuel to accelerate Chicago’s growth as a global hub of innovation and technology. During his tenure as Chicago’s CIO, Brett successfully worked toward a comprehensive consolidation of technology while rapidly accelerating the role of innovation in government. His achievements have included changing Chicago’s technology strategy to include cloud environments, and reshaping the IT portfolio to include advanced analytics with a focus on urban prediction. Brett was also the chief data officer for the City of Chicago, the first position of this kind for a major municipality, where he led the city’s data strategy to help improve the way the city’s information works for its residents.

Before coming to City Hall, Brett was one of the youngest commanders in the Chicago Police Department, where he founded and directed the department’s Predictive Analytics Group, which aimed to predict violent crime patterns. Previously, Brett was an early employee with OpenTable, where he played an integral role in scaling the operation from a handful of restaurants in San Francisco to a network that operates worldwide. He holds a bachelor’s degree from Connecticut College, an MS in criminal justice from Suffolk University, and an MS in computer science from University of Chicago. Brett is pursuing his PhD in criminology, law, and justice at the University of Illinois-Chicago. He resides in Chicago with his wife and three children.

Presentations

Thinking outside the black box: The imperative for accountability and transparency in predictive analytics Session

How can we usher in a future of data-driven decision making that is characterized by more—not less—accountability and accessibility? Brett Goldstein discusses the imperative to couple new developments in data science with a renewed commitment to transparency and open source—with a particular focus on open source models to optimize deployment of policing resources.

Josh Gordon works as a developer advocate on TensorFlow at Google. He’s passionate about machine learning and computer science education. In his free time, Josh loves biking, running, and exploring the great outdoors.

Presentations

Ask me anything: Deep learning with TensorFlow AMA

Martin Wicke and Josh Gordon field questions related to their tutorial, Deep Learning with TensorFlow.

Deep learning with TensorFlow Tutorial

Martin Wicke and Josh Gordon offer hands-on experience training and deploying a machine-learning system using TensorFlow, a popular open source library. You'll learn how to build machine-learning systems from simple classifiers to complex image-based models as well as how to deploy models in production using TensorFlow Serving.

Brian Granger is an associate professor of physics and data science at Cal Poly State University in San Luis Obispo. Brian is a leader of the IPython project, cofounder of Project Jupyter, and an active contributor to a number of other open source projects focused on data science in Python. Recently, he cocreated the Altair package for statistical visualization in Python. He is a advisory board member of NumFOCUS and a faculty fellow of the Cal Poly Center for Innovation and Entrepreneurship.

Presentations

JupyterLab: The evolution of the Jupyter Notebook Session

Brian Granger, Sylvain Corlay, and Jason Grout offer an overview of JupyterLab, the next-generation user interface for Project Jupyter that puts Jupyter Notebooks within a powerful user interface that allows the building blocks of interactive computing to be assembled to support a wide range of interactive workflows used in data science.

Jonathan Gray is the founder and CEO of Cask. Jonathan is an entrepreneur and software engineer with a background in startups, open source, and all things data. Previously, he was a software engineer at Facebook, where he helped drive HBase engineering efforts, including Facebook Messages and several other large-scale projects, from inception to production. An open source evangelist, Jonathan was responsible for helping build the Facebook engineering brand through developer outreach and refocusing the open source strategy of the company. Prior to Facebook, Jonathan founded Streamy.com, where he became an early adopter of Hadoop and HBase. He is now a core contributor and active committer in the community. Jonathan holds a bachelor’s degree in electrical and computer engineering from Carnegie Mellon University.

Presentations

Unified integration for data lakes and modern data applications Session

Building, running, and governing a data lake on Hadoop is often a difficult process filled with slow development cycles and painful operations. Jonathan Gray proposes a modern, unified integration architecture that helps IT mitigate these issues while enabling businesses to reduce time to insights and make decisions faster through a modern self-service environment.

Garrett Grolemund is a data scientist and chief instructor for RStudio, Inc. Garrett is a longtime user and advocate of R; he wrote the popular lubridate package for working with dates and times in R. Garrett designed and delivered the highly rated O’Reilly video series Introduction to Data Science with R and is the author of Hands-On Programming with R and the coauthor, with Hadley Wickham, of R for Data Science. He holds a PhD in statistics and specializes in teaching others how to do data science with open source tools.

Presentations

R for big data Tutorial

Garrett Grolemund and Nathan Stephens explore the new sparklyr package by RStudio, which provides a familiar interface between the R language and Apache Spark and communicates with the Spark SQL and the Spark ML APIs so R users can easily manipulate and analyze data at scale.

Jason Grout is a Jupyter developer at Bloomberg, working primarily on JupyterLab and the interactive Jupyter widgets library. He has also been a major contributor to the open source Sage mathematical software system and co-organizes the PyDataNYC Meetup. Previously, Jason was an assistant professor of mathematics at Drake University in Des Moines, Iowa. He holds a PhD in mathematics from Brigham Young University.

Presentations

JupyterLab: The evolution of the Jupyter Notebook Session

Brian Granger, Sylvain Corlay, and Jason Grout offer an overview of JupyterLab, the next-generation user interface for Project Jupyter that puts Jupyter Notebooks within a powerful user interface that allows the building blocks of interactive computing to be assembled to support a wide range of interactive workflows used in data science.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Ask me anything: Hadoop application architectures AMA

Mark Grover, Jonathan Seidman, and Ted Malaska, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Hadoop application architectures: Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing Tutorial

Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska demonstrate how to architect a modern, real-time big data platform and explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning.

Top five mistakes when writing Spark applications Session

Ted Malaska and Mark Grover cover the top five things that prevent Spark developers from getting the most out of their Spark clusters. When these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters and the same data, using just a different approach.

Jack Gudenkauf is a senior architect with the HPE Platform Services team. Jack has 30 years of experience building strong engineering teams in companies ranging from startups to Fortune 500 corporations and prior hands-on experience as a vice president of big gata, product unit manager, architect, and developer designing and implementing Internet-scale distributed systems.

Presentations

A new “Sparkitecture” for modernizing your data warehouse Session

Jack Gudenkauf explores how organizations have successfully deployed tiered hyperscale architecture for real-time streaming with Spark, Kafka, Hadoop, and Vertica and discusses how advancements in hardware technologies such as nonvolatile memory, SSDs, and accelerators are changing the role of big data and big analytics platforms in an overall enterprise-data-platform strategy.

Carlos Guestrin is the director of machine learning at Apple and the Amazon Professor of Machine Learning in Computer Science and Engineering at the University of Washington. Carlos was the cofounder and CEO of Turi (formerly Dato and GraphLab), a machine-learning company acquired by Apple. A world-recognized leader in the field of machine learning, Carlos was named one of the 2008 Brilliant 10 by Popular Science. He received the 2009 IJCAI Computers and Thought Award for his contributions to artificial intelligence and a Presidential Early Career Award for Scientists and Engineers (PECASE).

Presentations

Why should I trust you? Explaining the predictions of machine-learning models Session

Despite widespread adoption, machine-learning models remain mostly black boxes, making it very difficult to understand the reasons behind a prediction. Such understanding is fundamentally important to assess trust in a model before we take actions based on a prediction or choose to deploy a new ML service. Carlos Guestrin offers a general approach for explaining predictions made by any ML model.

Sarah Guo joined Greylock in 2013. Previously, Sarah was at Goldman Sachs, where she invested in Dropbox, helped take Workday public, and advised pre-IPO private technology companies (as well as public clients including Zynga, Netflix and Nvidia) on strategic and financial issues. Earlier in her career, Sarah worked with Casa Systems, a venture-funded startup that enables cable operators to meet growing demand for broadband services. She is an advocate for STEM education for women and the underserved, as well as education more generally, has taught marketing in the Wharton undergraduate program, and served as a teaching fellow in lower-income high schools for the Philadelphia World Affairs Council. She is also interested in drones, the connected home, wearables, robotics, 3-D printing, and software innovations in the healthcare and financial industries. Sarah holds four degrees from the Wharton School and the University of Pennsylvania and is a Lauder Institute fellow and a graduate of the Huntsman Program.

Presentations

Where's the puck headed? Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road. Join us to hear about the trends that everyone is seeing and areas for investment that they find exciting.

Himanshu Gupta is a software engineer at Yahoo and a Druid project committer. Himanshu has been working with Hadoop-based data pipelines and related platforms for the past few years and currently focuses on use of Druid inside Yahoo. Outside of work, Himanshu has written a video game for mobile, published solutions to pretty much all the exercises in How to Prove It, and dabbled in AI and ML algorithms. He’s a computer science autodidact and holds an MS degree in physics from the Indian Institute of Technology, Kanpur.

Presentations

Beyond Hadoop at Yahoo: Interactive analytics with Druid Session

Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications.

Amir Hajian is a data scientist at Thomson Reuters. Previously, he was a senior research associate at the Canadian Institute for Theoretical Astrophysics in Toronto and a research physicist at Princeton University. Amir has a passion for data science, developing and applying new algorithms for data analysis using (Bayesian) statistics, machine learning, visualization, and big data technology. He holds a PhD in astrophysics.

Presentations

A data-driven approach to the US presidential election Session

Amir Hajian, Khaled Ammar, and Alex Constandache offer an approach to mining a large dataset to predict the electability of hypothetical candidates in the US presidential election race, using machine learning, natural language processing, and deep learning on an infrastructure that includes Spark and Elasticsearch, which serves as the backbone of the mobile game White House Run.

Martin Hall is the senior director of analytics platforms at Intel. Martin has spent the past 10 years at the heart of the analytics and big data marketplace; before joining Intel, he was at Karmasphere and FICO, where he worked closely within the ecosystem of innovative vendors and individuals to deliver analytics-based value to customers and individuals across a wide range of markets and solutions. Martin is based in the San Francisco Bay Area. His professional pursuits overlap with his personal activities in triathlons, rocketry, and photography.

Presentations

Collaboration and openness drive innovation in artificial intelligence Keynote

The power of artificial intelligence and advanced analytics emerges from the ability to analyze and compute large, disparate datasets from varied devices and locations, such as predictive medicine and automated cars, at lightning-fast speed. Martin Hall explains why collaboration and openness are the key elements driving innovation in AI.

Eui-Hong (Sam) Han is the director of big data and personalization at the Washington Post. Sam is an experienced practitioner of data mining and machine learning and has an in-depth understanding of analytics technologies. He has successfully applied these technologies to solve real business problems. At the Washington Post, he leads a team building an integrated big data platform to store all aspects of customer profiles and activities from both digital and print circulation, content metadata, and business data. His team is building an infrastructure, tools, and services to provide personalized experience to customers, empower the newsroom with data for better decisions, and provide targeted advertising capability. Previously, he led the Big Data practice at Persistent Systems, started the Machine Learning Group in Sears Holdings’s online business unit, and worked for a data mining startup company. Sam’s expertise includes data mining, machine learning, information retrieval, and high-performance computing. He holds a PhD in computer science from the University of Minnesota.

Presentations

How the Washington Post uses machine learning to predict article popularity Session

Predicting which stories will become popular is an invaluable tool for newsrooms. Eui-Hong Han and Shuguang Wang explain how the Washington Post predicts what stories on its site will be popular with readers and share the challenges they faced in developing the tool and metrics on how they refined the tool to increase accuracy.

Hao Hao is a software engineer at Cloudera currently working on the Apache Sentry project, a granular, role-based authorization module for the Hadoop cluster. She is also a PMC of the Apache Sentry (TLP) project. Hao performed extensive research on smartphone security and web security while she was a PhD student at Syracuse University. Prior to joining Cloudera, Hao worked on eBay’s Search Backend team building search infrastructure for eBay’s online buying platform.

Presentations

Authorization in the cloud: Enforcing access control across compute engines Session

Li Li and Hao Hao elaborate the architecture of Apache Sentry + RecordService for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access control, to encourage attendees to adopt Apache Sentry and RecordService to protect sensitive data on the multitenant cloud across the Hadoop ecosystem.

Yaron Haviv is the CTO and founder of iguaz.io as well as a serial entrepreneur with deep technological experience in the fields of big data, cloud, storage, and networking. Previously, Yaron was the vice president of datacenter solutions at Mellanox, where he led technology innovation, software development, and solution integrations and was the key driver of open source initiatives and new solutions with leading database and storage vendors, enterprise organizations, and cloud and Web 2.0 customers. Before Mellanox, Yaron was the CTO and vice president of R&D at Voltaire, a high-performance computing, I/O, and networking company. Yaron is a thought leader who often speaks at big data and cloud technology events and writes a popular technology blog.

Presentations

How to achieve zero-latency IoT and FSI data processing with Spark Session

Yaron Haviv explains how to design real-time IoT and FSI applications, leveraging Spark with advanced data frame acceleration. Yaron then presents a detailed, practical use case, diving deep into the architectural paradigm shift that makes the powerful processing of millions of events both efficient and simple to program.

Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic net regularization in Spark’s ML library and one-pass elastic net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

Presentations

Spark Structured Streaming for machine learning Session

Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark's new Structured Streaming and walk you through creating your own streaming model.

Brendan Herger is a data scientist at Capital One working on understanding how to leverage its data to empower its customers.

Presentations

Machine-learning techniques for class imbalances and adversaries Session

Many areas of applied machine learning require models optimized for rare occurrences, such as class imbalances, and users actively attempting to subvert the system (adversaries). Brendan Herger offers an overview of multiple published techniques that specifically attempt to address these issues and discusses lessons learned by the Data Innovation Lab at Capital One.

Brian Hopkins is a principal analyst at Forrester Research, where he covers emerging technology, technology innovation, data management, and big data for enterprise architecture professionals. Brian’s research provides practical advice to architects and IT strategists seeking to leverage emerging technology, improve their technology innovation practices, and evolve their data management capabilities.

Presentations

The insight-driven business Session

Uber, Netflix, LinkedIn, Tesla, Stitch Fix, Earnest—the list of digital disruptors using data to steal customers grows every month. But is it just that these firms are data driven? Is because they have smart data scientists and Hadoop? The secret to their success is that these firms go further in order to be insight driven. Brian Hopkins explains what they're doing and how to join them.

Juliet Hougland answers complex business problems using statistics to tame multiterabyte datasets. She succeeds in applying and explaining the results of mathematical models across a variety of industries including software, industrial energy, retail, and consumer packaged goods. Juliet is currently the head of data science, engineering at Cloudera, where she focuses on using data to help engineering build high-quality products. Juliet’s been sought after by Cloudera’s customers as a field-facing data scientist advising on which tools to use, teaching how to use them, recommending the best approach to bring together the right data to answer the business problem at hand, and building production machine-learning models. For many years, Juliet has been a contributor in the open source community working on projects such as Apache Spark, Scalding, and Kiji. Juliet holds an MS in applied mathematics from the University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in math-physics.

Presentations

Guerrilla guide to Python and Apache Hadoop Tutorial

Juliet Hougland and Sean Owen offer a practical overview of the basics of using Python data tools with a Hadoop cluster, covering HDFS connectivity and dealing with raw data files, running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating), and using Apache Spark to write more complex analytical jobs.

Juan Huerta is the head of decision sciences at Goldman Sachs’s new Consumer Lending Group, where he is responsible for leading the development of statistical models and algorithms that will support the Goldman Sachs Consumer Lending business, including models for targeting, customer segmentation, loan decisioning, conversion attribution, and fraud detection. Previously, Juan worked in organizations including IBM Watson Research, PlaceIQ, Citibank, Dow Jones, and Dragon Systems Inc. developing machine-learning algorithms to help computers acquire, decode, and understand human intention from raw signals. He holds a PhD in electrical and computer engineering from Carnegie Mellon University.

Presentations

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Upcoming challenges and opportunities for data technologies in consumer finance FinData

The release of Hadoop fundamentally changed the ability of financial enterprises to address velocity, variety, and volume in data. Ten years later, Juan Huerta describes the most significant data-oriented technical challenges the industry currently faces and the promising confluence of technologies and modeling paradigms that will drive the evolution of data technologies during the next decade.

John Hugg has spent his entire career working with databases and information management. In 2008, John was lured away from a PhD program by Mike Stonebraker to work on what became VoltDB. As the first engineer on the product, he liaised with a team of academics at MIT, Yale, and Brown who were building H-Store, VoltDB’s research prototype. Then John helped build the world-class engineering team at VoltDB to continue development of the open source and commercial products.

Presentations

VoltDB and the Jepsen test: What we learned about data accuracy and consistency Session

VoltDB promises full ACID with strong serializability in a fault-tolerant, distributed SQL platform, as well as higher throughput than other systems that promise much less. But why should users believe this? John Hugg discusses VoltDB's internal testing and support processes, its work with Kyle Kingsbury on the VoltDB Jepsen testing project, and where VoltDB will continue to improve.

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Tackling machine-learning complexity for data curation Session

Machine-learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.

Matthew Jacobs is a software engineer at Cloudera working on Impala.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Carey James is the director of business development for big data solutions in the EMC Global Solutions organization, where he works with customers to identify and develop strategies to help take advantage of the key value propositions associated with big/fast data and analytics. Carey is a telecommunications industry leader with proven expertise in the development of solutions that bridge the gap between the business and technical communities to solve the needs of the customer. He focuses on how business can use analytics to create value and how IT can provide the appropriate analytics experience, in areas including data management, analytics, data integration, and data visualization.

Presentations

Achieve richer insights and business outcomes with Dell EMC big data and analytics Session

Big data and analytics is a team sport empowering companies of all kinds to achieve business outcomes faster and with greater levels of success. Carey James explains how the formation of Dell Technologies and Dell EMC can help you on your data analytics journey and how you can turn actionable insights into new business opportunities.

Phil Jarymiszyn is the director of big data integration services at Novantas. Phil has over 28 years of experience building enterprise/application data stores for banks and brokers. He has banking data domain expertise in all categories of bank operational systems and data requirements expertise in both analytical and operational use cases and is a BI expert for analytical and data democratization initiatives. Phil holds a BA in economics from Harvard University.

Presentations

How a Spark-based feature store can accelerate big data adoption in financial services Session

Kaushik Deka and Phil Jarymiszyn discuss the benefits of a Spark-based feature store, a library of reusable features that allows data scientists to solve business problems across the enterprise. Kaushik and Phil outline three challenges they faced—semantic data integration within a data lake, high-performance feature engineering, and metadata governance—and explain how they overcame them.

Nandu Jayakumar is a software architect and engineering leader at Visa, where he is currently responsible for the long-term architecture of data systems and leads the data platform development organization. Previously, as a senior leader of Yahoo’s well-regarded data team, Nandu built key pieces of Yahoo’s data processing tools and platforms over several iterations, which were used to improve user engagement on Yahoo websites and mobile apps. He also designed large-scale advertising systems and contributed code to Shark (SQL on Spark) during his time there. Nandu holds a bachelor’s degree in electronics engineering from Bangalore University and a master’s degree in computer science from Stanford University, where he focused on databases and distributed systems.

Presentations

Swipe, dip, and hover: Managing card payment data at Visa Session

Visa, the world’s largest electronic payments network, is transforming the way it manages data: database appliances are giving way to Hadoop and HBase; proprietary ETL technologies are being replaced by Spark; and enterprise warehouse data models will be complemented by flexible data schemas. Nandu Jayakumar explores the adoption of big data practices at a conservative, financial enterprise.

Jolene Jeffries is director of software and services strategic operations at GE Oil & Gas, where she is a key driving force of the value of Industrial Internet. Jolene has been the executive sponsor of several IoT and analytics initiatives at GE Oil & Gas, driving asset and process optimization to increase field service profitability. Throughout her 18-year career at GE, she has held several key finance roles and is Six Sigma Black Belt certified.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Driving field service profitability with advanced analytics Data Case Studies

GE Oil & Gas is at the forefront of leveraging the Industrial Internet and advanced analytics to drive profitability and growth. Jolene Jeffries and Tara Prakriya explain how subject-matter experts are using advanced analytics and machine learning to directly contribute to the profitability of the business unit.

Chad W. Jennings is a product manager for BigQuery at Google Cloud. Chad came to Google from the startup world. He is an avid skier and surfer. When he’s not working on big things or playing in nature, he’s at home with his wife and two young children. Chad holds a PhD in aeronautics and astronautics from Stanford University.

Presentations

BigQuery for data warehousing

BigQuery provides petabyte-scale data warehousing with consistently high performance for all users. However, users coming from traditional enterprise data warehousing platforms often have questions about how best to adapt their workloads for BigQuery. Chad Jennings explores best practices and integration with BigQuery with special emphasis on loading and transforming data for BigQuery.

Google BigQuery for enterprise Keynote

Chad W. Jennings demonstrates the power of BigQuery through an exciting demo and announces several new features that will make BigQuery a better home for your enterprise big data workloads.

Brian Kahn is a senior science writer at Climate Central, a journalism and research nonprofit, and a lecturer at Columbia University’s Department of Earth and Environmental Sciences. Brian previously worked at the International Research Institute for Climate and Society producing multimedia stories, managing social media campaigns, and developing version 2.0 of Climate.gov. His writing has appeared in the Wall Street Journal, Grist, the Daily Kos, Justmeans, and the Yale Climate Connections and has been cited in the New York Times, Washington Post, Slate, and Vox. In previous lives, he led sleigh ride tours through a herd of 7,000 elk and guided tourists around the deepest lake in the US.

Presentations

Shifting cities: A case study in data visualization Session

Radish Lab teamed up with science news nonprofit Climate Central to transform temperature data from 1,001 US cities into a compelling, simple interactive that received more than 1 million views within three days of launch. Alana Range and Brian Kahn offer an overview of the process of creating a viral, interactive data visualization with teams that regularly produce powerful data stories.

Chris Kakkanatt is the director of business analytics and insights at Pfizer. Previously, he held several consulting positions at IBM. Chris holds an MBA in finance and value investing from Colombia Business School and a bachelors of science in computer science from Cornell University.

Presentations

Beyond the numbers: Expanding the size of your analytic discovery team Session

Analytic discovery is a team sport; the lone hero data scientist is a thing of the past. John Akred of Silicon Valley Data Science leads a panel of analytics and data experts from Pfizer, the City of San Diego, and Neustar that explores how these businesses were changed through analytic collaboration.

David Kale is a deep learning engineer at Skymind and a PhD candidate in computer science at the University of Southern California, where he is advised by Greg Ver Steeg of the USC Information Sciences Institute. His research uses machine learning to extract insights from digital data in high-impact domains, such as healthcare, and he collaborates with researchers from Stanford Center for Biomedical Informatics Research and the YerevaNN Research Lab. Recently, David pioneered the application of deep learning to modern electronic health records data. At Skymind, he works with clients and partners to develop and deploy deep learning solutions for real world problems. David co-organizes the Machine Learning for Healthcare Conference (MLHC) and has served as a judge in several XPRIZE competitions, including the upcoming IBM Watson AI XPRIZE. He is the recipient of the Alfred E. Mann Innovation in Engineering Fellowship.

Presentations

Conditional recurrent neural nets, generative AI Twitter bots, and DL4J Session

Can machines be creative? Josh Patterson and David Kale offer a practical demonstration—an interactive Twitter bot that users can ping to receive a response dynamically generated by a conditional recurrent neural net implemented using DL4J—that suggests the answer may be yes.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

The devil is in the details: Interactive, multiscale visualization of data lineage Session

Traditional ways of visualizing data lineage provide static mapping source datasets to various targets or outputs. As the breadth of analysis occurring in schema-on-read environments increases, tracking how elements of the data were derived is critical. Sean Kandel introduces a new way to visualize data lineage allowing stakeholders a transparent view into their data.

Hanna Kang-Brown is an artist and designer in New York City and a former journalist—born and raised in Los Angeles. Hanna’s art explores the politics of memory, race, and ancestry. Hanna uses taste to represent data, and her projects have recently been shown in the Netherlands, Switzerland, and New York. Hanna earned a master’s degree from NYU’s Tisch School of Arts Interactive Telecommunications Program and is employed as a Senior Experience Designer at R/GA.

Presentations

Five-senses data: Using your senses to improve data signal and value Session

Data should be something you can see, feel, hear, taste, and touch. Drawing on real-world examples, Cameron Turner, Brad Sarsfield, Hanna Kang-Brown, and Evan Macmillan cover the emerging field of sensory data visualization, including data sonification, and explain where it's headed in the future.

Amit Kapoor is interested in learning and teaching the craft of telling visual stories with data. At narrativeVIZ Consulting, Amit uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Amit also teaches storytelling with data as guest faculty in executive courses at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. He has more than 12 years of management consulting experience with AT Kearney in India, Booz & Company in Europe, and more recently for startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad. Find more about him at Amitkaps.com.

Presentations

Model visualization Session

Though visualization is used in data science to understand the shape of the data, it's not widely used for statistical models, which are evaluated based on numerical summaries. Amit Kapoor explores model visualization, which aids in understanding the shape of the model, the impact of parameters and input data on the model, the fit of the model, and where it can be improved.

Reiner Kappenberger is a global product manager at HPE Security–Data Security. Reiner has over 20 years of computer software industry experience focusing on encryption and security for big data environments. His background ranges from device management in the telecommunications sector to GIS and database systems. Reiner holds a diploma in computer science from the Regensburg University of Applied Sciences in Germany.

Presentations

Trusted IoT and big data ecosystems Session

Reiner Kappenberger explores the new standards and innovations enabling architects and developers to take a “build it in” approach to security in early design phases for big data and IoT systems, explaining why emerging technologies such as format-preserving encryption are rapidly delivering more trusted big data and IoT ecosystems without altering application behavior or device functionality.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work she enjoys playing with fire, riding scooters, and dancing.

Presentations

Spark Structured Streaming for machine learning Session

Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark's new Structured Streaming and walk you through creating your own streaming model.

Nitin Kaul works as an innovation lead on the Applied Technology team at Merck, where he manages projects to evaluate and apply innovative new information technologies to unmet needs and interacts with external collaborators, academic research groups, technology vendors, and commercial startups. He also serves as a project/solutions manager, solution architect, and technical lead for innovation proof of concept projects and oversees knowledge capture and transfer of the new technology into Merck for any internal engineering groups in collaboration with Internal and external SMEs. Nitin’s areas of focus include the Hadoop ecosystem, data science initiatives, open compute, distributed scale-out storage, OpenStack, microservices architecture, distributed application development, the health IT platform, and identity access management and entitlements. Nitin holds a master’s degree in computer science and a bachelor’s degree in electronics and system engineering.

Presentations

Cold chain analytics: Using Revolution R and the Hadoop ecosystem Data Case Studies

Nitin Kaul and Richard Baumgartner demonstrate how Merck applies descriptive, predictive, and prescriptive analytics leveraging parallel distributed libraries and the predictive modeling capabilities of Revolution R deployed on a secure Hadoop cluster to identify the various factors for product temperature excursions and predict and prevent future temperature excursions in product shipments.

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Mubashir Kazia is a solutions architect at Cloudera focusing on security. Mubashir started the initiative integrating Cloudera Manager with Active Directory for kerberizing the cluster and provided sample code. Mubashir has also contributed patches to Apache Hive that fixed security-related issues.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Many Hadoop clusters lack even basic security controls. Michael Yoder, Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You'll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Some inventions are the result of purposeful problem solving—like sippy cups to prevent toddlers from spilling juice or rolling luggage to make travel easier. Some are accidental, discoveries made by people working on something else entirely; for example, while testing heart medications, scientists noted side effects of one in particular, and the drug Viagra was born. Award-winning writer Pagan Kennedy explores the science of human imagination as it pertains to innovation and creativity. She unearths commonalities that predict the success of inventors, and theorizes that the skills required for “inventology” can be taught and learned.

In her latest book, Inventology: How We Dream Up Things That Change the World, "a delightful account of how inventors do what they do” (Kirkus Reviews), Pagan reveals the imaginative and practical processes behind groundbreaking innovations across numerous disciplines. From in-depth research and exhaustive interviews, she shows why successful inventors tend to be passionate, polymathic amateurs versus focused professionals working inside their fields. She explores whether serendipitous inspiration can be coaxed, suggests how to raise kids to be resourceful and inventive, and describes what factors beyond the “Aha!” moment are required for successful product development.

Pagan’s years of science reporting inform her takeaways on innovation, creativity, iconoclasts, and self-invention. Her 11 books include The First Man-Made Man, a study of early 20th Century transsexual Laura (formerly Michael) Dillon, whose desire to feel comfortable in her own skin drove experimentation and led to breakthrough medical technologies. Pagan’s journalism has appeared in dozens of publications including the New York Times Magazine, where she wrote the Innovation/Who Made That? column. Pagan’s Head, her early ‘zine, anticipated today’s highly personal, self-produced creative culture. She has also taught widely, including at Dartmouth College, Boston College, and Johns Hopkins University. As a Knight Science Journalism Fellow at MIT, Pagan studied microbiology and neuroengineering; she has won numerous other awards including an NEA fellowship, a Smithsonian fellowship, and two Massachusetts Cultural Council fellowships.

Presentations

The art and science of serendipity Keynote

How do we discover what we're not looking for? In the age of big data and bioinformatics, the answer is more relevant than ever. We develop new tools to help us spot clues in mountains of information, and yet, serendipity remains a very human art. Pagan Kennedy discusses the origins of the word serendipity and qualities of mind that lead to successful searches in the deep unknown.

Paul Kent is vice president of big data initiatives at SAS, where he divides his time between customers, partners, and the Research & Development teams discussing, evangelizing, and developing software at the confluence of big data and high-performance computing. Paul was previously vice president of the Platform R&D division at SAS, where he led groups responsible for the SAS foundation and mid-tier technologies—teams that develop, maintain, and test Base SAS, as well as related data access, storage, management, presentation, connectivity, and middleware software products. Paul has contributed to the development of SAS software components including PROC SQL, TCP/IP connectivity, the Output Delivery System (ODS), and more recently the Inside-Database and High-Performance initiatives. A strong customer advocate, Paul is widely recognized within the SAS community for his active participation in the community and at local and international user conferences. Paul was educated at WITS in South Africa, graduating with a bachelor of commerce (with honors) followed by an almost complete MBA (interrupted to try a North American posting). He got his commercial introduction to using computers to make better business decisions in the gold division of Anglo American.

Presentations

SAS: More open than you might think Keynote

Hadoop and its ecosystem have changed analytics profoundly. Paul Kent offers an overview of SAS's participation in open platforms and introduces SAS Viya, a new unified and open analytics architecture that lets you scale analytics in the cloud and code as you choose.

Kenn Knowles is a founding committer of Apache Beam (incubating). Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.

Presentations

Triggers in Apache Beam (incubating) Session

Triggers specify when a stage of computation should emit output. With a small language of primitive conditions, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources. Kenneth Knowles delves into the details of language- and runner-independent semantics for triggers in Apache Beam and explores real-world implementations in Google Cloud Dataflow.

Mike Koelemay runs the data science team within advanced analytics at Sikorsky, where he is responsible for bringing state-of-the-art analytics and algorithm technologies to support the ingestion, processing, and serving of data collected onboard thousands of aerospace assets around the world. Drawing on his 10+ years of experience in applied data analytics for integrated system health management technologies, Mike works with other software engineers, data architects, and data scientists to support the execution of advanced algorithms, data mining, signal processing, system optimization, and advanced diagnostics and prognostics technologies, with a focus on rapidly generating information from large, complex datasets.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Sikorsky's aircraft data platform: Turning data collected on-board thousands of aircraft into information for decision making Data Case Studies

Sikorsky collects data on-board thousands of helicopters deployed worldwide that is used for fleet management services, engineering analyses, and business intelligence. Mike Koelemay explores the data platform that Sikorsky has built to manage the ingestion, processing, and serving of this data so that it can be used to rapidly generate information to drive decision making.

Jaya Kolhatkar is vice president of engineering with @WalmartLabs, the Silicon Valley-based tech arm of Walmart’s ecommerce division, where she is responsible for defining and testing a predictive intelligence platform that allows data scientists to build and deploy predictive algorithms that can influence customer experience in real time and reduce fraud. Jaya joined Walmart when they acquired Inkiru, the company that she cofounded and served at as the chief analytics officer, which focused on leveraging data to create business value, particularly through the use of predictive analytics. Prior to Inkiru, Jaya spent four years at eBay Inc., at both eBay and PayPal, developing infrastructure to facilitate real-time linking of customers across eBay properties to enhance customer experience. Jaya holds a master’s of business administration from Villanova University.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Making data work for 240 million customers Data Case Studies

As the world's largest retailer, Walmart relies on data to power the best shopping experience across the Web, mobile, and stores at scale. Every week, more than 240 million people visit a Walmart store or website around the world. Jaya Kolhatkar explains how Walmart is using that transactional data to make shopping more seamless and personalized.

Madhuri Kollu heads the corporate Data and Analytics group at Sabre Holdings, where her areas of focus include enterprise dashboards, data federation, big data, analytics, and reporting. Madhuri has extensive experience in implementing a wide variety of BI and analytics tools to help organizations drive their business decisions using data and analytics. Madhuri holds a master’s degree in computer science from Drexel University and an undergraduate degree in computer engineering from University of Madras, India.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Self-service data integration in an IT-managed environment Data Case Studies

Sabre operates stringent service-level agreements with each of its customers. Madhuri Kollu explains how, in the event of an incident, Sabre consolidates legacy data with data derived from its new ServiceNow platform to get an accurate picture of the SLAs and provide business managers the information they need to understand the impact.

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.

Presentations

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker explain how to simplify Hadoop-based data-centric applications with the CRUD (create, read, update, and delete) and interactive analytic functionality of Apache Impala (incubating) and Apache Kudu (incubating).

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Performance tuning your SQL-on-Hadoop deployment may seem overwhelming at times, especially for BI workloads that need interactive response times with high concurrency. Marcel Kornacker and Mostafa Mokhtar simplify the process and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Keith Kraus is a Senior Engineer of Applied Solutions Engineering at NVIDIA in the Washington, DC, area. At NVIDIA, Keith’s focus is on building GPU-accelerated solutions around data engineering, analytics, and visualization. Prior to working for NVIDIA, Keith did extensive data engineering, systems engineering, and data visualization work in the cybersecurity domain focused on building a GPU-accelerated big data solution for advanced threat detection and cyber-hunting capabilities. Previously, Keith was a member of a research team that built a tool designed to optimally place automated defibrillators in urban environments. Keith graduated from Stevens Institute of Technology with a BEng in computer engineering and an MEng in networked information systems.

Presentations

Streaming cybersecurity into Graph: Accelerating data into Datastax Graph and Blazegraph Session

Cybersecurity has become a data problem and thus needs the best-in-breed big data tools. Joshua Patterson, Michael Wendt, and Keith Kraus explain how Accenture Labs's Cybersecurity team is using Apache Kafka, Spark, and Flink to stream data into Blazegraph and Datastax Graph to accelerate cyber defense.

Raj Krishnamurthy designs and develops system stacks consisting of software and hardware elements for emerging and contemporary data analytics workloads. He has been a technical staff member in the Enterprise Systems division at IBM since 2006. His work has impacted several platforms, software products, and roadmaps in IBM—both on mainframes and Power Systems. Raj holds 76+ patents (with 60+ still pending) and has written a number of external peer-reviewed publications. Raj holds a PhD in computer science and an MS/BS degree in electrical engineering.

Presentations

Tuning Spark machine-learning workloads Session

Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22.

Vineet Kumar works as the Kentucky Transportation Cabinet’s data architect. Vineet has 14 years of IT experience, has worked with many database product vendors like Oracle, Microsoft, and IBM, and has played many key roles including developer, database designer, and data warehouse architect. In his free time, he writes technical blogs. Vineet is passionate about open source products. His goal is to provide business data in real time to business users for analytics.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Kentucky Transportation Cabinet: Monitoring road activities with a real-time snow and ice information management system using Spark and Hadoop Data Case Studies

Kentucky's Transportation Cabinet is integrating streaming data—crowdsourced from Waze, Twitter, weather reports, sensors, and snow truck status—to improve public safety, reduce congestion, and enhance operations. Vineet Kumar shares how the data is processed using GeoEvent Processor, ArcServer, SDE, and Hadoop.

Richard Langlois is the president of IT Architecture & Strategy, which provides training and consulting services in big data, analytics, BI, enterprise architecture, and data governance. Previously, Richard was the director of search and big data analytics and director of enterprise data management for Yellow Pages (Canada), where his team provided development of solutions, data architecture and governance, and metadata management for all operational and analytics needs of Yellow Pages. Prior to his roles at Yellow Pages, Richard was enterprise architect adviser at National Bank and Desjardins Group and global chief architect at TataCommunications and led the Canadian BI practice at Capgemini. He also worked directly or though consulting mandates at Air Canada, Bell Canada, CN, Canadian Tire, GM, Hydro-Quebec, Investors Group, Seer Technologies, Sikorsky Aircraft, Texas Instruments, Unisys, and multiple government agencies.

Presentations

Yellow Pages (Canada): Our journey to speed of thought interactive analytics on top of Hadoop Session

The self-service YP Analytics application allows advertisers to understand their digital presence and ROI. Richard Langlois explains how Yellow Pages used this expertise for an internal use case that delivers real-time analytics with Tableau, using OLAP on Hadoop and enabled by its stack, which includes HDFS, Parquet, Hive, Impala, and AtScale, for fast, real-time analytics and data exploration.

Josh Laurito is the head of data and analytics at Univision Digital. and teaches data visualization at the City University of New York. Previously, Josh worked at Gawker Media and helped start Lumesis, an analytics and data visualization company. Despite having only taken six jobs in his adult life he has somehow worked for 10 different companies. If you are interested in learning more, please visit joshlaurito.com.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Web analytics in the platform era: The Gawker Media experience Data Case Studies

In the last year, many publishers have begun moving more content off their own websites and onto Facebook Instant Articles, Accelerated Mobile Pages (AMP), and other new platforms promising improvements in exposure and performance. Joshua Laurito explores Gawker’s experience working with these new partners, sharing the advantages gained as well as the unexpected costs and complexities incurred.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Pig. Julien was previously an architect at Dremio and the tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

Office Hour with Julien Le Dem (Dremio) Office Hours

Ask Julien about the future of column-oriented data processing with Arrow and ParquetSession, columnar, and hardware trends like RDMA, SSDs, and nonvolatile memory.

The future of column-oriented data processing with Arrow and Parquet Session

In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory.

Xavier Léauté is a software engineer at Confluent as well as a founding Druid committer and PMC member. Prior to his current role he headed the backend engineering team at Metamarkets.

Presentations

Streaming analytics at 300 billion events per day with Kafka, Samza, and Druid Session

Ever wondered what it takes to scale Kafka, Samza, and Druid to handle complex, heterogeneous analytics workloads at petabyte size? Xavier Léauté discusses his experience scaling Metamarkets's real-time processing to over 3 million events per second and shares the challenges encountered and lessons learned along the way.

Mike Lee Williams is director of research at Fast Forward Labs, an applied machine intelligence lab in New York City, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Fast Forward Labs’s clients understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Unlocking unstructured text data with summarization Session

Our ability to extract meaning from unstructured text data has not kept pace with our ability to produce and store it, but recent breakthroughs in recurrent neural networks are allowing us to make exciting progress in computer understanding of language. Building on these new ideas, Michael Williams explores three ways to summarize text and presents prototype products for each approach.

Josh Lemaitre is a senior data scientist at Thomson Reuters Labs, which partners with startups and academia to develop cutting-edge technology solutions that integrate big data infrastructure, machine learning and predictive analytics, and domain expertise in the financial, legal, and risk sectors. Previously, Josh was the director of analytics at RichRelevance, a personalization technology provider servicing retail customers and brand advertisers, where he oversaw the business intelligence and machine-learning functions. Prior to RichRelevance, Josh served in various analytic roles in the marketing and merchandising departments of Overstock.com. He holds a bachelor of science in economics from Duke University and a master of engineering from MIT.

Presentations

Predicting patent litigation Session

How can the value of a patent be quantified? Josh Lemaitre explores how Thomson Reuters Labs approached this problem by applying machine learning to the patent corpus in an effort to predict those most likely to be enforced via litigation. Josh covers infrastructure, methods, challenges, and opportunities for future research.

Jill Lepore is the David Woods Kemper ’41 Professor of American History at Harvard University and a staff writer at the New Yorker. Her books include the New York Times best-seller The Secret History of Wonder Woman; Book of Ages, a finalist for the National Book Award; and, most recently, Joe Gould’s Teeth. Jill lives in Cambridge, Massachusetts.

Presentations

The trouble with polls Keynote

American politics is adrift in a sea of polls. This year, that sea is deeper than ever before—and darker. Data science is upending the public opinion industry. But to what end? In a brief, illustrated history of the field, Jill Lepore demonstrates how pollsters rose to prominence by claiming that measuring public opinion is good for democracy and asks, "But what if it’s bad?"

Guy Levy-Yurista is the head of product at Sisense. Guy specializes in commercializing advanced technologies in dynamic market environments, drawing on a unique mix of skills acquired over 25 years of experience in startup, venture capital, and Fortune 500 environments. Previously, Guy served as the executive vice president for the Usher Mobile Identity program with MicroStrategy, where he productized and launched a powerful identity intelligence tool; was the CTO for AirPatrol; was part of the executive team that sold an MDM company to McAfee; and led multiple teams at AOL and several startups. An experienced entrepreneur and venture capitalist, Guy holds a PhD degree in physics from the Weizmann Institute of Science in Rehovot, Israel, and an MBA from the Wharton School of the University of Pennsylvania.

Presentations

Future-proofing BI: An unexpected journey to leverage in-chip analytics in the IoT and AI Session

Guy Levy-Yurista explains the unexpected consequences of making big data processing significantly more agile than ever before and the impact it's having on human insight consumption.

Haoyuan Li is founder and CEO of Alluxio (formerly Tachyon Nexus), a memory-speed virtual distributed storage system. Before founding the company, Haoyuan was working on his PhD at UC Berkeley’s AMPLab, where he cocreated Alluxio. He is also a founding committer of Apache Spark. Previously, he worked at Conviva and Google. Haoyuan holds an MS from Cornell University and a BS from Peking University.

Presentations

Alluxio (formerly Tachyon): The journey thus far and the road ahead Session

Haoyuan Li offers an overview of Alluxio (formerly Tachyon), a memory-speed virtual distributed storage system. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features. This year, the goal is to make Alluxio accessible to an even wider set of users through a focus on security, new language bindings, and APIs.

Li Li is a software engineer on Google’s Cloud team. Previously, Li worked at Cloudera on RecordService and Apache Sentry projects. She is also a committer and PMC of the Apache Sentry (TLP) project. Li holds a master’s degree in computer science from Vanderbilt University.

Presentations

Authorization in the cloud: Enforcing access control across compute engines Session

Li Li and Hao Hao elaborate the architecture of Apache Sentry + RecordService for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access control, to encourage attendees to adopt Apache Sentry and RecordService to protect sensitive data on the multitenant cloud across the Hadoop ecosystem.

Tianhui Michael Li is the founder and CEO of the Data Incubator. Michael has worked as a data scientist lead at Foursquare, a quant at D.E. Shaw and JPMorgan, and a rocket scientist at NASA. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup that lets him focus on what he really loves. He did his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall scholar.

Presentations

Practical machine learning Tutorial

Tianhui Li and Robert Schroll of the Data Incubator offer a foundation in building intelligent business applications using machine learning, walking you through all the steps to prototyping and production—data cleaning, feature engineering, model building and evaluation, and deployment—and diving into an application for anomaly detection and a personalized recommendation engine.

As an enterprise architect, Douglas Liming leverages his 19 years of SAS experience to design cradle-to-grave enterprise analytic infrastructures. His solutions address data extraction and manipulation through predictive analytics with SAS at their core, focusing on Hadoop at the epicenter. In addition to sales support and enablement, Doug provides support for complex POCs where deep knowledge of various Hadoop distributions and virtualized hardware help to highlight the value of big data and SAS analytics. Active in the Executive Briefing Center, he is also a leader for various business analytics modernization assessments (BAMAs). His first 17 years at SAS were spent in R&D, working to integrate SAS with numerous data sources for data mining. Doug enjoys explaining how SAS and open source compliment each other. Skilled at breaking down the hype, he excels at simplifying highly technical topics and conversations so that nontechnical audiences can be involved in the discussion. He holds a BS in computer science from the University of North Carolina at Wilmington, where he ran Division I cross country and track. As a Linux enthusiast, he currently runs Fedora 23 on his laptop and uses Fedora 21 as his primary desktop.

Presentations

How an open analytics ecosystem became a lifesaver Session

Ready to take a deeper look at how Hadoop and its ecosystem has a widespread impact on analytics? Douglas Liming explains where SAS fits into the open ecosystem, why you no longer have to choose between analytics languages like Python, R, or SAS, and how a single, unified open analytics architecture empowers you to literally have it all.

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Presentations

Apache Kudu: 1.0 and beyond Session

Apache Kudu was first announced as a public beta release at Strata NYC 2015 and recently reached 1.0. This conference marks its one year anniversary as a public open source project. Todd Lipcon offers a very brief refresher on the goals and feature set of the Kudu storage engine, covering the development that has taken place over the last year.

Creating real-time, data-centric applications with Impala and Kudu Session

Todd Lipcon and Marcel Kornacker explain how to simplify Hadoop-based data-centric applications with the CRUD (create, read, update, and delete) and interactive analytic functionality of Apache Impala (incubating) and Apache Kudu (incubating).

Jun Liu is a senior performance engineer in Intel’s Software and Service group, where he works in the area of big data performance modeling and simulation, especially SQL-on-Hadoop systems. Before Intel, Jun was a postdoctoral researcher and senior member of the Database Performance and Migration group (DPMG) at Dublin City University. His primary research focus area is data migration and database performance optimization. Jun also worked as a software engineer at Ericsson and has participated in the development of different projects in the areas of real-time complex events processing and big data analysis. Jun holds a PhD in computing from Dublin City University, an MSc in advanced software engineering from University College Dublin, and a BSc in computer science from Dublin Institution of Technology.

Presentations

Planning your SQL-on-Hadoop cluster for a multiuser environment with heterogeneous and concurrent query workloads Session

Many challenges exist in designing an SQL-on-Hadoop cluster for production in a multiuser environment with heterogeneous and concurrent query workloads. Jun Liu and Zhaojuan Bian draw on their personal experience to address these challenges, explaining how to determine the right size of your cluster with different combinations of hardware and software resources using a simulation-based approach.

Veronica Liwak is a data analyst at Polaris, a leader in the global fight to eradicate modern slavery and restore freedom to survivors, where she is responsible for researching and analyzing different types of human trafficking. Veronica supports Polaris’ Strategic Initiative to End Trafficking in Illicit Massage Businesses and focuses on research and development.

Presentations

Citi, Standard Charter Bank, and Polaris: The modern information pipeline that fuels investigations of money laundering, fraud, and human trafficking Session

Join data experts from Citi, Standard Charter Bank, and Polaris for a panel discussion moderated by Shankar Ganapathy. Learn about the principles, technologies, and processes they have used to design a highly efficient information management pipeline architected around the Hadoop ecosystem.

Nir Lotan is a machine-learning product manager and team manager in Intel’s Advanced Analytics department. Nir’s team develops machine-learning and deep learning-related tools, including a tool that enables easy creation of deep learning models. Prior to this role, Nir held several product, system, and software management positions within Intel’s Design Center organization and other leading companies. Nir has 15 years of experience in software and systems engineering, products, and management. He holds a BSc degree in computer engineering from the Technion Institute of Technology.

Presentations

Fast deep learning at your fingertips Session

Amitai Armon and Nir Lotan outline a new, free software tool that enables the creation of deep learning models quickly and easily. The tool is based on existing deep learning frameworks and incorporates extensive optimizations that provide high performance on standard CPUs.

Stuart Lynn is a map scientist with CartoDB, a company that specializes in beautiful online geospatial visualizations and location intelligence. Stuart initially studied mathematical physics at Edinburgh University before deciding astronomy was prettier and easier to explain in bars and earned a PhD in astrophysics. He perviously worked at the Adler Planetarium as the technical lead of the Zooniverse, the largest collection of online citizen science projects. Stuart is passionate about getting everyone involved in doing real science, collecting, analyzing, and mapping data, and making real discoveries about themselves and the world.

Presentations

Designing a location intelligence platform for everyone by integrating data, analysis, and cartography Session

Geospatial analysis can provide deep insights into many datasets. Unfortunately the key tools to unlocking these insights—geospatial statistics, machine learning, and meaningful cartography—remain inaccessible to nontechnical audiences. Stuart Lynn and Andy Eschbacher explore the design challenges in making these tools accessible and integrated in an intuitive location intelligence platform.

Evan Macmillan is the cofounder and CEO of Gridspace. Previously, he cofounded Zappedy, a payments technology company that was backed by Eric Schmidt’s venture fund and acquired by Groupon. Evan’s patents include an HCI invention and a bank infrastructure capability. He holds an engineering bachelor’s degree in product design from Stanford.

Presentations

Five-senses data: Using your senses to improve data signal and value Session

Data should be something you can see, feel, hear, taste, and touch. Drawing on real-world examples, Cameron Turner, Brad Sarsfield, Hanna Kang-Brown, and Evan Macmillan cover the emerging field of sensory data visualization, including data sonification, and explain where it's headed in the future.

Roger Magoulas is the research director at O’Reilly Media and chair of the Strata + Hadoop World conferences. Roger and his team build the analysis infrastructure and provide analytic services and insights on technology-adoption trends to business decision makers at O’Reilly and beyond. He and his team find what excites key innovators and use those insights to gather and analyze faint signals from various sources to make sense of what others may adopt and why.​

Presentations

Thursday keynotes Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Ask me anything: Hadoop application architectures AMA

Mark Grover, Jonathan Seidman, and Ted Malaska, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Hadoop application architectures: Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing Tutorial

Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska demonstrate how to architect a modern, real-time big data platform and explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning.

Top five mistakes when writing Spark applications Session

Ted Malaska and Mark Grover cover the top five things that prevent Spark developers from getting the most out of their Spark clusters. When these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters and the same data, using just a different approach.

Presentations

ODPi: The foundation for cross-distribution interoperability Session

With so much variance across Hadoop distributions, ODPi was established to create standards for both Hadoop components and testing applications on those components. Join John Mertic and Berni Schiefer to learn how application developers and companies considering Hadoop can benefit from ODPi.

A scientist, best-selling author, and entrepreneur, Gary Marcus is currently professor of psychology and neural science at NYU and CEO and cofounder of the recently formed Geometric Intelligence, Inc. Gary’s efforts to update the Turing test have spurred a worldwide movement and his research on language, computation, artificial intelligence, and cognitive development has been published widely in leading journals such as Science and Nature. He is also the author of four books, including The Algebraic Mind, Kluge: The Haphazard Evolution of the Human Mind, and the New York Times best-seller Guitar Zero, and contributes frequently to the the New Yorker and the New York Times. Gary’s most recent book, The Future of the Brain: Essays By the World’s Leading Neuroscientists, features the 2014 Nobel Laureates May-Britt and Edvard Moser.

Presentations

From big data to human-level artificial intelligence Keynote

Gary Marcus explores the gap between what machines do well and what people do well and what needs to happen before machines can match the flexibility and power of human cognition.

Bruce Martin is a senior instructor at Cloudera, where he teaches courses on data science, Apache Spark, Apache Hadoop, and data analysis. Previously, Bruce was principal architect and director of advanced concepts at SunGard Higher Education, where he developed the software architecture for SunGard’s Course Signals Early Intervention System, which uses machine learning algorithms to predict the success of students enrolled in university courses. His other roles have included senior staff engineer at Sun Microsystems and researcher at Hewlett-Packard Laboratories. Bruce has written many papers on data management and distributed system technologies and frequently presents his work at academic and industrial conferences. Bruce has authored patents on distributed object technologies. He holds a PhD and master’s degree in computer science from the University of California, San Diego, and a bachelor’s degree in computer science from the University of California, Berkeley.

Presentations

Data science at scale: Using Spark and Hadoop Training

Learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Through in-class simulations and exercises, Bruce Martin walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field.

Data science at scale: Using Spark and Hadoop (Day 2) Training day 2

Learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Through in-class simulations and exercises, the instructor walks attendees through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field.

Terry McFadden is the principal enterprise information architect at Procter & Gamble. He has a long history of attacking wicked problems in the data area, holds several US patents, was an early text analytics practitioner, and has been driving P&G’s big data efforts. Terry holds an MBA from Xavier University.

Presentations

Big data is a household word: How Procter & Gamble uses on-cluster Hadoop BI to give visual insight to hundreds of business users for everyday use Session

Terry Mcfadden and Priyank Patel discuss Procter and Gamble's three-year journey to enable production applications with on-cluster BI technology, exploring in detail the architecture challenges and choices made by the team along this journey.

Patrick McFadin is one of the leading experts in Apache Cassandra and data-modeling techniques. As a consultant and the chief evangelist for Apache Cassandra at DataStax, Patrick has helped build some of the largest and most exciting deployments in production. Prior to DataStax, he was chief architect at Hobsons, an education services company. There, Patrick spoke often on web application design and performance.

Presentations

Conquer the time series data pipeline with SMACK Tutorial

We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, while users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day with powerful data pipelines built with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka.

Richard (Rick) McFarland is the vice president of data service at the Hearst Corporation, one of the largest diversified communications companies in the world, where he has led the way in establishing its first big data platform that is the central connection point for all data systems through the Hearst network. This two petabyte platform streams data from all of Hearst vast resources at the rate of two terabytes per day. Rick’s data strategy has paved the way for tens of millions of dollars of revenue impact and initiatives that been transformative to Hearst.

Prior to Hearst, Rick led analytics teams at Amazon in both the global marketing and Kindle departments. These teams were responsible for building and utilizing large-scale data resources focused on customer analytics and marketing effectiveness measurement. Previously, Rick worked in financial data services as a partner with Novantas (a boutique financial services consulting firm in New York), an EVP at Bank of America, and at Washington Mutual Bank, where he ran the Retail Pricing practice. He holds a PhD in statistics from the University of Virginia, a master’s degree in engineering and operations research from Stanford University, and an undergraduate degree in mathematics from the University of Kansas. Rick holds pending patents based on his work in the area of secure data analysis and preserving customer privacy and is a frequent invited speaker at universities and conferences, often on the subject of data science and data collaboration. He is also on the advisory board for the Stanford Cookie Clearinghouse, which provides information for users to make choices about online privacy.

Presentations

Life of a click: How Hearst manages clickstream analytics in the cloud Session

Rick McFarland explains how the Hearst Corporation utilizes big data and analytics tools like Spark and Kinesis to stream click data in real-time from its 300+ websites worldwide. This streaming process feeds an editorial tool called Buzzing@Hearst, which provides instant feedback to authors on what is trending across the Hearst network.

Jim McHugh is vice president and general manager at NVIDIA. He currently leads DGX-1, the world’s first AI supercomputer in a box. Jim focuses on building a vision of organizational success and executing strategies to deliver computing solutions that benefit from GPUs in the data center. With over 25 years of experience as a marketing and business executive with startup, mid-sized, and high-profile companies, Jim has a deep knowledge and understanding of business drivers, market/customer dynamics, technology-centered products, and accelerated solutions. Previously, Jim held leadership positions with Cisco Systems, Sun Microsystems, and Apple, among others.

Presentations

Changing the landscape with deep learning and accelerated analytics Session

Customers are looking to extend the benefits beyond big data with the power of the deep learning and accelerated analytics ecosystems. Jim McHugh explains how customers are leveraging deep learning and accelerated analytics to turn insights into AI-driven knowledge and covers the growing ecosystem of solutions and technologies that are delivering on this promise.

Jeffrey McMillan is currently chief analytics and data officer for Morgan Stanley, where he built out the industry’s first data-driven investment recommendation platform that delivered targeted investment ideas. He focuses on developing the next generation of wealth management practices that look to leverage sophisticated analytics and digital capabilities to help relationship managers deliver the highest-quality investment advice customized to the unique needs of each client. Previously, Jeff was managing director at Credit Suisse within its private banking and wealth management business.

Presentations

Driving change: Intelligent systems in wealth management FinData

Jeff McMillan explores how he has used intelligent systems and predictive modeling at Morgan Stanley to fundamentally change the client service model from merely “selling” things to one that focuses on finding the things that customers want or need to invest in.

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Xiangrui Meng is an Apache Spark PMC member and a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. Xiangrui has been actively involved in the development and maintenance of Spark MLlib since he joined Databricks. Previously, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. He holds a PhD from Stanford, where he worked on randomized algorithms for large-scale linear regression problems.

Presentations

Ask me anything: The state of Spark AMA

Join Xiangrui Meng and Ram Sriharsha to discuss the state of Spark.

Recent developments in SparkR for advanced analytics Session

Xiangrui Meng explores recent community efforts to extend SparkR for scalable advanced analytics—including summary statistics, single-pass approximate algorithms, and machine-learning algorithms ported from Spark MLlib—and shows how to integrate existing R packages with SparkR to accelerate existing R workflows.

Leo Meyerovich cofounded Graphistry, Inc. to scale visual graph analysis (think exploring security alerts) by connecting browsers to GPU clusters. Graphistry builds upon the founding team’s work at UC Berkeley on the first parallel web browser and Superconductor, a declarative GPU-accelerated data visualization language. Some of Leo’s most referenced work is in language-based security and automatic verification for web apps and across control policies. His broader programming language design research received awards for the first reactive web language (OOPSLA, NSF GRFP), automatic parallelization and parallelizing the web browser (PLDI, Qualcomm Innovation Fellow), and sociological foundations (OOPSLA, SIGPLAN).

Presentations

Investigating event graphs at scale: Going from theory to practice Session

Visual analysis is changing in the era of GPU clusters. Now that scale compute is easier, the bottleneck is mapping data to visualizations and intelligently interacting with them. Using datasets uploaded to Graphistry, Leo Meyerovich provides a glimpse into the emerging workflows for graph and linked event analysis and offers common tricks for success.

Ingo Mierswa is a veteran data scientist and the cofounder and CEO of RapidMiner, the company he developed in the Artificial Intelligence Division of TU Dortmund, Germany. Ingo is responsible for strategic innovation at RapidMiner, deals with all big picture questions around its technologies, and serves on the board. Under his leadership, RapidMiner has grown up to 300% per year over the first seven years. In 2012, he spearheaded the go-international strategy with the opening of offices in the US, the UK, and Hungary. After two rounds of fundraising, the acquisition of Radoop, and supporting the positioning of RapidMiner with leading analyst firms like Gartner and Forrester, Ingo takes a lot of pride in bringing the world’s best team to RapidMiner. Ingo has also authored numerous award-winning publications about predictive analytics and big data.

Presentations

The flux capacitor of machine learning: Turn data garbage into 1.21 gigawatt-powered acceleration Session

The flux capacitor was the core component that made time travel possible in Back to the Future, processing garbage as a power source. Did you know that you can achieve the same affect in machine learning? Ingo Mierswa demonstrates how you can power through your analytics faster than ever before using the knowledge of 250K data scientists.

Daniel Mintz is the chief data evangelist at Looker. Previously, he was head of data and analytics at fast-growing media startup Upworthy and director of analytics at political powerhouse MoveOn.org. Throughout his career, Daniel has focused on how people interact with data in their everyday lives and how they can use it to get better at what they do. He believes that with the right tools and some basic training, anybody can learn to make data-informed decisions that lead to better results for themselves and their business. And he’s deeply committed to using data to make the world a more sustainable, more just place.

Presentations

Winning with data: How ThredUp, Twilio, and Warby Parker use data to build advantage Session

Daniel Mintz dives into case studies from three companies—ThredUp, Twilio, and Warby Parker—that use data to generate sustainable competitive advantages in their industries.

Mostafa Mokhtar is a performance engineer at Cloudera. Previously, he held similar roles at Hortonworks and on the SQL Server team at Microsoft.

Presentations

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop Session

Performance tuning your SQL-on-Hadoop deployment may seem overwhelming at times, especially for BI workloads that need interactive response times with high concurrency. Marcel Kornacker and Mostafa Mokhtar simplify the process and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Kim Montgomery is the head of analytics at GridCure, where she works on predictive modeling for the utility industry. Kim has a broad applied mathematics background with expertise in both predictive modeling and differential equations. Previously, as a postdoctoral scholar at the University of Utah and as a visiting professor at the Rose Hulman Institute of Technology, she did mathematical biology research and taught applied mathematics. Her research has included using feedback control to stabilize solutions to differential equations, modeling hair cells in the inner ear, and studying signaling between retinal cells during development. She has completed more than 30 predictive modeling projects through Kaggle.com on topics such as predicting which used cars would be bad buys, predicting the jobs that would most interest a job seeker, and predicting the composition of soil from its spectral properties. She has been ranked 15th on Kaggle. She holds a PhD in applied mathematics from Northwestern University.

Presentations

Using the explosion of data in the utility industry to prevent explosions in utility infrastructure Session

With the advent of smart grid technology, the quantity of data collected by electrical utilities has increased by 3–5 orders of magnitude. To make full use of this data, utilities must expand their analytical capabilities and develop new analytical techniques. Kim Montgomery discusses some ways that big data tools are advancing the practice of preventative maintenance in the utility industry.

Jon Morra is the Vice President of Data Science at ZEFR. In this role, he leads a team of data scientists responsible for creating data-driven models. Jon and his team are focused on bringing ZEFR’s wealth of information about video on the internet to help better drive customer’s needs and meet market demands. Previously, Jon was the Director of Data Science at eHarmony, where he helped grow the data science team to support multiple business facets.

Presentations

Data science at eHarmony: A generalized framework for personalization Session

Data science has always been a focus at eHarmony, but recently more business units have needed data-driven models. Jonathan Morra introduces Aloha, an open source project that allows the modeling group to quickly deploy type-safe accurate models to production, and explores how eHarmony creates models with Apache Spark and how it uses them.

John Morrell is senior director of product marketing at Datameer, where he leads the go-to-market efforts for the Datameer product family and focuses on how customers use Datameer to solve their business problems. John has a 25-year history in enterprise software, bringing to market numerous enterprise software products and working extensively to help solve difficult business problems in data management, BI, and analytics for such companies as Aleri, Coral8, Active Software, webMethods, Oracle, Informix, and Fair Isaac. John holds an MBA from Bentley College and a BS in computer engineering from Syracuse University.

Presentations

Big data journeys from the real world Session

A panel of practitioners from from Dell, National Instruments, and Citi—companies that are gaining real value from big data analytics—explore their companies' big data journeys, explaining how analytics can answer groundbreaking new questions about business and create a path to becoming a data-driven organization.

Andreas Müller is a lecturer at the Data Science Institute at Columbia University and author of Introduction to Machine Learning with Python (O’Reilly), which describes a practical approach to machine learning with Python and scikit-learn. His mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science, and democratize the access to high-quality machine learning algorithms. Andreas is one of the core developers of the scikit-learn machine learning library and has been comaintaining it for several years. He is also a Software Carpentry instructor. Previously, he worked at the NYU Center for Data Science on open source and open science and as a machine learning scientist at Amazon.

Presentations

Machine learning in Python Tutorial

Scikit-learn, which provides easy-to-use interfaces to perform advances analysis and build powerful predictive models, has emerged as one of the most popular open source machine-learning toolkits. Using scikit-learn and Python as examples, Andreas Mueller offers an overview of basic concepts of machine learning, such as supervised and unsupervised learning, cross-validation, and model selection.

Office Hour with Andreas Mueller (NYU) Office Hours

Join Andreas to discuss machine learning, data science—in particular with Python—and using scikit-learn.

Kiran Muglurmath is the executive director of big data analytics at Comcast, where he manages a team of data scientists and big data engineers for machine learning, data mining, and predictive analytics. Prior to Comcast, Kiran was a consulting big data platform architect and data scientist at T-Mobile and Boeing. He holds an MBA from the Kellogg School at Northwestern University and a computer science degree from Bangalore University.

Presentations

Powering real-time analytics on Xfinity using Kudu Session

Sridhar Alla and Kiran Muglurmath explain how real-time analytics on Comcast Xfinity set-top boxes (STBs) help drive several customer-facing and internal data-science-oriented applications and how Comcast uses Kudu to fill the gaps in batch and real-time storage and computation needs, allowing Comcast to process the high-speed data without the elaborate solutions needed till now.

Praveen Murugesan runs the Hadoop Platform team at Uber, which offers a data platform supporting large-scale data processing and interactive SQL as a service to the company. Previously, Praveen was part of the core infrastructure team at Salesforce that built out the messaging platform for enterprise customers.

Presentations

Big data processing with Hadoop and Spark, the Uber way Session

Praveen Murugesan explains how Uber leverages Hadoop and Spark as the cornerstones of its data infrastructure. Praveen details the current data architecture at Uber and outlines some of the unique challenges with data processing Uber faced as well as its approach to solving some key issues in order to continue to power Uber's real-time marketplace.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).

Presentations

The future of column-oriented data processing with Arrow and Parquet Session

In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory.

Raghunath Nambiar is the chief technology officer of Cisco’s Unified Computing System (UCS) business. He helps define strategies for next generation architectures, systems, and datacenter solutions, as well as leads a team of engineers and product leaders focused on emerging technologies and solutions – big data, analytics, internet of things and artificial intelligence. He has played an instrumental role in accelerating the growth of the Cisco UCS to a top datacenter compute platform. Raghu was previously a Cisco distinguished engineer and chief architect of big data and analytics solution engineering responsible for incubating and growing it to a mainstream portfolio. He brings years of technical accomplishments with significant expertise in systems architecture, performance engineering, and creating disruptive technology solutions. Raghu has served in leadership positions on industry standards committees for performance evaluation and leading academic conferences. He chaired industry’s first standards committee for benchmarking big data systems, industry’s first standards committee for benchmarking Internet of Things, and founding chair of TPC’s International Conference Series on Performance Evaluation and Benchmarking. He has published more than 50 peer-reviewed papers and book chapters, 10 books in Lecture Series in Computer Science (LNCS), and holds six patents with several pending. Prior to Cisco, Raghu was an architect at Hewlett-Packard responsible for several industry-first and disruptive technology solutions and a decade of performance benchmark leadership. He holds master’s degrees from University of Massachusetts and Goa University and completed an advanced management program from Stanford University.

Raghu’s recent book titled Transforming Industry Through Data Analytics examines the role of analytics in enabling digital transformation, how the explosion in internet connections affects key industries, and how applied analytics will impact our future.

Presentations

Business insights driven by speed Keynote

The need to quickly acquire, process, prepare, store, and analyze data has never been greater. The need for performance crosses the big data ecosystem too—from the edge to the server to the analytics software, speed matters. Raghunath Nambiar shares a few use cases that have had significant organizational impact where performance was key.

Neha Narkhede is the cofounder and CTO at Confluent, a company backing the popular Apache Kafka messaging system. Previously, Neha led streams infrastructure at LinkedIn, where she was responsible for LinkedIn’s petabyte-scale streaming infrastructure built on top of Apache Kafka and Apache Samza. Neha specializes in building and scaling large distributed systems and is one of the initial authors of Apache Kafka. A distributed systems engineer by training, Neha works with data scientists, analysts, and business professionals to move the needle on results.

Presentations

Apache Kafka: The rise of real-time data and stream processing Session

Neha Narkhede explains how Apache Kafka serves as a foundation to streaming data applications that consume and process real-time data streams and introduces Kafka Connect, a system for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library. Neha also describes the lessons companies like LinkedIn learned building massive streaming data architectures.

Office Hour with Neha Narkhede (Confluent) Office Hours

Join Neha for a deeper dive into Apache Kafka use cases, using Kafka for stream processing, comparing Kafka Streams's lightweight library approach with heavier, framework-based tools such as Spark Streaming or Storm, and how Kafka Connect can be combined with other tools such as stream processing frameworks to create a complete streaming data integration solution.

Rimma Nehme is a TA in the Data group at Microsoft. Previously, Rimma was a principal software engineer in the Microsoft Jim Gray Systems Lab, where she jump-started polybase technology bridging the worlds of relational data and big data. Over the past 10 years, she has worked in Microsoft Research and on Microsoft product teams for numerous database technologies and platforms, including SQL Server and SQL Data Warehouse, focusing on statistics, query optimization, and approximate query processing. Rimma holds a CS PhD from Purdue University and an MBA from the University of Chicago.

Presentations

5 cloud AI innovations Session

The amount of cutting-edge technology that Azure puts at your fingertips is incredible. Artificial intelligence is no exception. Azure enables sophisticated capabilities in artificial intelligence, machine learning, deep learning, cognitive services, and advanced analytics. Rimma Nehme explains why Azure is the next AI supercomputer and how this vision is being implemented in reality.

Mark Nelson is global head of MI and data management within the Financial Crime Compliance team for Standard Chartered Bank based in London and the work-stream lead within the Financial Crime Compliance Risk Mitigation Programme, focused on the design and delivery of improvements to management information frameworks and the supporting interfaces and processes. Mark is responsible for both the delivery of MI and data in BAU and the MI transformation program for the FCC and is the interface between IT and the business to ensure delivery of effective data management solutions in order meet the requirements of varied stakeholder communities within global organizations.

Presentations

Citi, Standard Charter Bank, and Polaris: The modern information pipeline that fuels investigations of money laundering, fraud, and human trafficking Session

Join data experts from Citi, Standard Charter Bank, and Polaris for a panel discussion moderated by Shankar Ganapathy. Learn about the principles, technologies, and processes they have used to design a highly efficient information management pipeline architected around the Hadoop ecosystem.

Tony Ng is a director of engineering at eBay, where he leads the User Behavior Analytics, Experimentation, and Marketing Platform products. Tony is involved in building eBay’s core platforms and services, including cloud, big data analytics, real-time streaming, web services, and messaging systems. Prior to eBay, Tony worked at Yahoo and Sun Microsystems.

Presentations

Pulsar: Real-time analytics at scale leveraging Kafka, Kylin, and Druid Session

Enterprises are increasingly demanding real-time analytics and insights. Tony Ng offers an overview of Pulsar, an open source real-time streaming system used at eBay. Tony explains how Pulsar integrates Kafka, Kylin, and Druid to provide flexibility and scalability in event and metrics consumption.

Jack Norris is the senior vice president of data and applications at MapR Technologies. Jack has a wide range of demonstrated successes, from defining new markets for small companies to increasing sales of new products for large public companies, in his 20 years spent in enterprise software marketing. Jack’s broad experience includes launching and establishing analytics, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity, Brio Technology, SQRIBE, and Bain & Company. Jack earned an MBA from UCLA’s Anderson School of Management and a BA in economics with honors and distinction from Stanford University.

Presentations

A data-first approach to drive real-time applications Session

Leading companies that are getting the most out of their data are not focusing on queries and data lakes; they are actively integrating analytics into their operations. Jack Norris reviews three customer case studies in ad/media, financial services, and healthcare to show how a focus on real-time data streams can transform the development, deployment, and future agility of applications.

Decision 2016: What is your data platform? Keynote

During election season, we’re tasked with considering the next four years and comparing platforms across candidates. What’s good for the country is good for your data. Consider what the next four years will look like for your organization. How will you lower costs and deliver innovation? Jack Norris reviews the requirements for a winning data platform, such as speed, scale, and agility.

Owen O’Malley is a software architect on Hadoop working for HortonWorks, a startup focusing on Hadoop development. Prior to cofounding HortonWorks, Owen and the rest of the HortonWorks team worked at Yahoo developing Hadoop. He has been contributing patches to Hadoop since before it was separated from Nutch and was the original chair of the Hadoop PMC. Before working on Hadoop, he worked on Yahoo Search’s WebMap project, which builds a graph of the known Web and applies many heuristics to the entire graph that control search. Prior to Yahoo, Owen wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He holds a PhD in software engineering from the University of California, Irvine.

Presentations

File format benchmark: Avro, JSON, ORC, and Parquet Session

Picking the best data format depends on what kind of data you have and how you plan to use it. Owen O'Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications.

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of @WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred, Mauricio Vacas, and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Stephen O'Sullivan, and Julie Steele will field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for CDO and its evolving role. Even if you don’t have a specific question, join in to hear what others are asking.

Amy O’Connor is a big data evangelist and telecommunications specialist at Cloudera, the leading big data vendor. She advises customers globally as they introduce big data solutions and adopt enterprise-wide big data delivery capabilities. Amy was recently named one of Information Management’s 10 Big Data Experts to Know. Prior to joining Cloudera, Amy built and ran Nokia’s big data team, developing and managing Nokia’s data assets and leading a team of data scientists to drive insights. Previously, Amy was vice president of services marketing and also led strategy for the software and storage business units of Sun Microsystems.

Presentations

Breeding data scientists: A four-year study Session

At Strata + Hadoop World 2012, Amy O'Connor and her daughter Danielle Dean shared how they learned and built data science skills at Nokia. This year, Amy and Danielle explore how the landscape in the world of data science has changed in the past four years and explain how to be successful deriving value from data today.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

The new dynamics of big data Keynote

Since its inception, big data solutions have best been known for their ability to master the complexity of the volume, variety, and velocity of data. But as we enter the era of data democratization, there’s a new set of concerns to consider. Mike Olson discusses the new dynamics of big data and how a renewed approach focused on where, who, and why can lead to cutting-edge solutions.

Lynn Overmann serves as senior policy advisor to the United States chief technology officer at the White House Office of Science and Technology Policy (OSTP). In this role, Lynn leads the White House Data-Driven Justice Initiative, focused on data-driven strategies to divert low-level offenders with mental illness out of the criminal system and change approaches to pre-trial incarceration so that low-risk offenders no longer stay in jail simply because they cannot afford a bond. She also oversees team CTO’s social and criminal justice reform efforts, including work on the White House Police Data Initiative, which focuses on using data to drive more effective community policing practices and opening police data to increase transparency and accountability.

Previously, Lynn was a presidential appointee at the US Department of Justice, helping to launch the Access to Justice Initiative, which focused on criminal and juvenile justice reform and improving legal services to the poor. She also spent a year deployed as a senior advisor to the mayor of New Orleans, helping to implement a range of reforms reducing the city’s jail size and launch the city’s first-ever plan to end homelessness and developing a project to link domestic violence legal service providers to local law firms for comprehensive pro bono representation. Before joining the Obama Administration, Lynn was a civil rights and criminal defense attorney in Miami, Florida, serving five years as a public defender. She graduated from the NYU School of Law and holds a BA from Bryn Mawr College.

Presentations

Ask me anything: White House Office of Science and Technology Policy AMA

Join DJ Patil and Lynn Overmann to ask your questions about data science at the White House.

Data science: A view from the White House Keynote

Keynote by DJ Patil and Lynn Overmann

Jerry Overton is a data scientist and distinguished technologist in DXC’s Analytics group, where he is the principal data scientist for industrial machine learning, a strategic alliance between DXC and Microsoft comprising enterprise-scale applications across six different industries: banking and capital markets, energy and technology, insurance, manufacturing, healthcare, and retail. Jerry is the author of Going Pro in Data Science: What It Takes to Succeed as a Professional Data Scientist (O’Reilly) and teaches the Safari training course Mastering Data Science at Enterprise Scale. In his blog, Doing Data Science, Jerry shares his experiences leading open research and transforming organizations using data science.

Presentations

Data science that works: Best practices for designing data-driven improvements, making them real, and driving change in your enterprise Tutorial

Join expert Jerry Overton as he explains how to make the business and technical aspects of your data strategy work together for best results.

How to build (and execute) a real data strategy Data 101

In chess, we might think of strategy as finding the patterns that put us in a better position to win. The same holds true for winning with data. Jerry Overton explains how to build and execute real data strategies, sharing basic methods for building strategic maps and launching data projects that will shorten the time it takes to gain insight into your most important business questions.

Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of Advanced Analytics on Spark and Mahout in Action. Previously, Sean was a senior engineer at Google.

Presentations

Guerrilla guide to Python and Apache Hadoop Tutorial

Juliet Hougland and Sean Owen offer a practical overview of the basics of using Python data tools with a Hadoop cluster, covering HDFS connectivity and dealing with raw data files, running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating), and using Apache Spark to write more complex analytical jobs.

Janaki Parameswaran has been delivering big data analytics solutions for 10+ years. She has expertise in ETL, data warehouses, cloud architectures, and business intelligence at scale.

Presentations

A unified ecosystem for market data visualization Session

FINRA ingests over 50 billion records of stock market trading data daily into multipetabyte databases. Janaki Parameswaran and Kishore Ramachandran explain how FINRA technology integrates data feeds from disparate systems to provide analytics and visuals for regulating equities, options, and fixed-income markets.

Robert Passarella evaluates AI and machine-learning investment managers for Alpha Features. Rob has spent over 20 years on Wall Street in the gray zone between business and technology, focusing on leveraging technology and innovative information sources to empower novel ideas in research and the investment process. A veteran of Morgan Stanley, JPMorgan, Bear Stearns, Dow Jones, and Bloomberg, he has seen the transformational challenges firsthand, up close and personal. Always intrigued by the consumption and use of information for investment analysis, Rob is passionate about leveraging alternative and unstructured data for use with machine learning techniques. Rob holds an MBA from the Columbia Business School.

Presentations

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Machine learning, unstructured data, and the emerging investing landscape FinData

With the emergence of the Internet, social media, and the IoT, the nature of analysis for investment decisions has shifted from linear analysis to nonlinear techniques. Robert Passarella offers a survey on how we arrived at this point in finance, where we came from, and where we're going, as we leave the world of model-driven finance and enter into the world of data-driven finance.

Priyank Patel is the cofounder and chief product officer at Arcadia Data. In this role, he leads the team’s charter in building visually beautiful and highly scalable analytical products, in addition to working with customers through their successful adoption. Prior to cofounding Arcadia, Priyank was part of the founding engineering team at Aster Data, where he designed core components of the Aster Database. He holds a master’s degree in computer science from Stanford University.

Presentations

Big data is a household word: How Procter & Gamble uses on-cluster Hadoop BI to give visual insight to hundreds of business users for everyday use Session

Terry Mcfadden and Priyank Patel discuss Procter and Gamble's three-year journey to enable production applications with on-cluster BI technology, exploring in detail the architecture challenges and choices made by the team along this journey.

DJ Patil is the chief data scientist and deputy chief technology officer for data policy at the White House Office of Science and Technology Policy, where he advises on policies and practices to maintain US leadership in technology and innovation, fosters partnerships to maximize the nation’s return on its investment in data, and helps to attract and retain the best minds in data science to serve the public. Since joining OSTP, DJ has collaborated with colleagues across government, including the chief information officer and the US Digital Service as part of the Obama administration’s commitment to open data and data science. He leads data science efforts related to the Precision Medicine Initiative, which focuses on utilizing advances in data and health care to provide clinicians with new tools, knowledge, and therapies to select which treatments will work best for which patients while protecting patient privacy.

DJ joined the White House following an incredible career as a data scientist—a term he helped coin—in the public and private sectors and in academia. Most recently, he served as the vice president of product at RelateIQ (acquired by Salesforce) and previously held positions at LinkedIn, Greylock Partners, and eBay, where he oversaw initiatives at eBay, PayPal, and Skype. Prior to his work in the private sector, DJ was an American Association for the Advancement of Science (AAAS) science and technology policy fellow for the Department of Defense, where he directed new efforts to bridge computational and social sciences in fields like social network analysis to help anticipate emerging threats to the United States. DJ has authored a number of influential articles and books explaining the important current and potential applications of data science. In 2014, the World Economic Forum named DJ a Young Global Leader. He holds a bachelor’s degree in mathematics from the University of California, San Diego, and a PhD in applied mathematics from the University of Maryland, where he used open datasets published by the National Oceanic and Atmospheric Administration (NOAA) to make major improvements in numerical weather forecasting.

Presentations

Ask me anything: White House Office of Science and Technology Policy AMA

Join DJ Patil and Lynn Overmann to ask your questions about data science at the White House.

Data science: A view from the White House Keynote

Keynote by DJ Patil and Lynn Overmann

Josh Patterson is the director of field engineering for Skymind. Previously, Josh ran a big data consultancy, worked as a principal solutions architect at Cloudera, and was an engineer at the Tennessee Valley Authority, where he was responsible for bringing Hadoop into the smart grid during his involvement in the openPDC project. Josh is a cofounder of the DL4J open source deep learning project and is a coauthor of Deep Learning: A Practitioner’s Approach. Josh has over 15 years’ experience in software development and continues to contribute to projects such as DL4J, Canova, Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif. Josh holds a master’s degree in computer science from the University of Tennessee at Chattanooga, where he did research in mesh networks and social insect swarm algorithms.

Presentations

Conditional recurrent neural nets, generative AI Twitter bots, and DL4J Session

Can machines be creative? Josh Patterson and David Kale offer a practical demonstration—an interactive Twitter bot that users can ping to receive a response dynamically generated by a conditional recurrent neural net implemented using DL4J—that suggests the answer may be yes.

Joshua Patterson is the director of applied solutions engineering at NVIDIA. Previously, Josh worked with leading experts across the public and private sectors and academia to build a next-generation cyberdefense platform. He was also a White House Presidential Innovation Fellow. His current passions are graph analytics, machine learning, and GPU data acceleration. Josh also loves storytelling with data and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina’s Moore School of Business.

Presentations

Streaming cybersecurity into Graph: Accelerating data into Datastax Graph and Blazegraph Session

Cybersecurity has become a data problem and thus needs the best-in-breed big data tools. Joshua Patterson, Michael Wendt, and Keith Kraus explain how Accenture Labs's Cybersecurity team is using Apache Kafka, Spark, and Flink to stream data into Blazegraph and Datastax Graph to accelerate cyber defense.

Maksim Pecherskiy is the chief data officer for the City of San Diego, where he strives to bring the necessary components together to allow the city’s residents to benefit from a more efficient, agile government that is as innovative as the community it surrounds. Previously, Maksim was a Code for America fellow in Puerto Rico with a focus on economic development. His team delivered a product called PrimerPeso, which provides business owners and residents a tool to search and apply for government programs for which they may be eligible. Maksim graduated from DePaul University with a bachelor of science degree in information systems and Linköping University, Sweden, with a bachelor of science degree in international business.

Presentations

Beyond the numbers: Expanding the size of your analytic discovery team Session

Analytic discovery is a team sport; the lone hero data scientist is a thing of the past. John Akred of Silicon Valley Data Science leads a panel of analytics and data experts from Pfizer, the City of San Diego, and Neustar that explores how these businesses were changed through analytic collaboration.

Thomas Phelan is cofounder and chief architect of BlueData. Previously, Tom was an early employee at VMware; as senior staff engineer, he was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular “pluggable storage architecture.” He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, Tom was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64-bit file system.

Presentations

Lessons learned running Hadoop and Spark in Docker Session

Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale environments poses new challenges, especially for big data applications like Hadoop. Thomas Phelan shares lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.

Thomas Place is the director of data management at First Data Corporation, where he leads enterprise data initiatives. Tom currently focuses on pivoting the organization’s transactional data to a value-add asset through development of an enterprise data lake, governance and quality controls, enabling new product development, and optimization of existing sales channels. Prior to First Data, Tom had over a decade of experience leading enterprise data technology initiatives in capital markets on both the buy and sell side. Tom holds a BSc in computer science from the University of Nottingham in England.

Presentations

From lake to reservoir: Harnessing big data’s power for the enterprise Session

Thomas Place explores the big data journey of the world’s biggest payment processor, which came dangerously close to building a data swamp before pivoting to embrace governance and quality-first patterns. This case study includes patterns, partners, successes, failures, and lessons learned to date and reviews the journey ahead.

James Powell is Nielsen’s Chief Technology Officer, leading our data science, engineering and technology teams. James joined Nielsen in this role in July 2015.

Immediately prior to joining Nielsen, James was Executive Vice President and Chief Technology Officer at Thomson Reuters, where he was responsible for all technology including scale shared services, shared information platforms, enterprise architecture and governance. In his seven years with Thomson Reuters, James served in a number of technology leadership roles, including Chief Technology Officer, Markets Division and Chief Technology Officer, Enterprise.

His career includes leadership positions in technology and data with Solace Systems, Citadel Investment Group, Reuters Group and TIBCO Finance Technology. He began his career with Teknekron Software Systems.

James holds a Bachelor of Science in Mathematics and a Master of Science in Industrial Robotics from Imperial College in London.

Presentations

Hadoop in the cloud: A Nielsen use case Keynote

Cloudera CEO Tom Reilly and James Powell, global CTO of Nielsen, discuss the dynamics of Hadoop in the cloud, what to consider at the start of the journey, and how to implement a solution that delivers flexibility while meeting key enterprise requirements.

Tara Prakriya is chief product officer at Maana, where she spearheads product strategy and direction. Previously, Tara was senior vice president of product management at Scantron and spent 15 years at Microsoft in various roles, including partner general manager of technical strategy reporting to the CTO; product unit manager of the Tablet PC group in Windows; product unit manager of the advertising and content management teams in MSN; and the group program manager for iDSS, the first large-scale data warehouse for consumer web activity for the worldwide MSN and MSNBC networks. Prior to Microsoft, she worked in financial data warehousing at Merck. She holds multiple patents related to web advertising, data, digital ink, and other technologies. Tara holds an MBA in finance and a bachelor of business administration in computer science.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Driving field service profitability with advanced analytics Data Case Studies

GE Oil & Gas is at the forefront of leveraging the Industrial Internet and advanced analytics to drive profitability and growth. Jolene Jeffries and Tara Prakriya explain how subject-matter experts are using advanced analytics and machine learning to directly contribute to the profitability of the business unit.

Stephen Pratt is the CEO of Noodle.ai. Stephen is an instigator, an AI and machine-learning nerd, and a business process geek. Previously, he was the founder and former CEO of Infosys Consulting and a senior partner at Deloitte.

Presentations

Corporate strategy: Artificial intelligence or bust Session

Stephen Pratt, the CEO of Noodle.ai and former head of Watson for IBM GBS, presents a shareholder value perspective on why enterprise artificial intelligence (eAI) will be the single largest competitive differentiator in business over the next five years—and what you can do to end up on top.

Sam Pullara is a managing director at Sutter Hill Ventures, where his current directorships include Boxer, FoundationDB, and Wavefront. He is also responsible for a number of the firm’s other investments, including Tomfoolery. Sam joined Sutter Hill Ventures from Twitter, where he was a senior infrastructure engineer and remains a technology advisor. He came to Twitter via the acquisition of Bagcheck in 2011, where he was cofounder and CEO. Previously, Sam was chief technologist at Yahoo responsible for technology strategy across the audience organization and platform team and chief architect at Borland after the acquisition of Gauntlet Systems, where he was also a cofounder and CEO. Sam came to Silicon Valley to work at WebLogic, where he was the first server engineer after the founders, and stayed on through the acquisition by BEA Systems. He holds an MS in physics from Northwestern University and a BS from Worcester Polytechnic Institute.

Presentations

Where's the puck headed? Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road. Join us to hear about the trends that everyone is seeing and areas for investment that they find exciting.

Uma Raghavan is the cofounder and CTO of Integris Software, an early-stage enterprise data privacy startup. Prior to Integris, Uma was director of data infrastructure at eBay, where she was in charge of Oracle, SAN, the Oracle data access software layer, the NoSQL stack, cloud database platforms built on OpenStack, messaging and streaming technologies, and all the BI/analytics platforms that powered the eBay website and its internal applications. She is a 22-year veteran of the technology industry with successful track record of designing, building and operating complex large web-scale software systems that intersects with big data, analytics, and machine learning with cloud and distributed computing technologies. Previously, Uma was director of software development for Amazon’s outbound fulfillment systems worldwide during Amazon’s explosive growth phase, led Amazon’s global catalog systems, helped launch Amazon in Italy, Spain, and Japan, and helped build Arizona—Amazon’s CRM technology—Amazon’s wireless store, and Amazon’s Auctions platform and 3P initiatives; she also worked at Microsoft, Walmart, and Daimler Chrysler. Uma holds a master’s degree in computer science from Wayne State University.

Presentations

Data risk intelligence in a regulated world Session

Uma Raghavan explains why you're about to see companies whose business models depend on using their customers' data, like Facebook, Google, and many others, scramble to keep up with the flood of new and evolving laws on data privacy.

Siva Raghupathy leads the Americas Big Data Solutions Architecture team at AWS, where he guides developers and architects in building successful big data solutions on AWS. Previously, as a principal technical program manager for AWS Database Service, Siva gathered emerging NoSQL requirements and wrote the first version of DynamoDB product specification. Later, as a development manager for Amazon Relational Database Services (RDS), he drove several enhancements. Prior to AWS, Siva spent several years at Microsoft.

Presentations

Big data architectural patterns and best practices on AWS Session

Siva Raghupathy demonstrates how to use Hadoop innovations in conjunction with Amazon Web Services (cloud) innovations.

Kishore Ramachandran is technology director in FINRA’s Market Regulation department leading the reporting and analytics team responsible for surveillance of equities and options markets.

Presentations

A unified ecosystem for market data visualization Session

FINRA ingests over 50 billion records of stock market trading data daily into multipetabyte databases. Janaki Parameswaran and Kishore Ramachandran explain how FINRA technology integrates data feeds from disparate systems to provide analytics and visuals for regulating equities, options, and fixed-income markets.

Karthik Ramasamy is the engineering manager and technical lead for real-time analytics at Twitter. Karthik is the cocreator of Heron and has more than two decades of experience working in parallel databases, big data infrastructure, and networking. He cofounded Locomatix, a company that specializes in real-time stream processing on Hadoop and Cassandra using SQL, which was acquired by Twitter. Before Locomatix, he had a brief stint with Greenplum, where he worked on parallel query scheduling. Greenplum was eventually acquired by EMC for more than $300M. Prior to Greenplum, Karthik was at Juniper Networks, where he designed and delivered platforms, protocols, databases, and high-availability solutions for network routers that are widely deployed in the Internet. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik has a PhD in computer science from UW Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Twitter's real-time stack: Processing billions of events with Heron and DistributedLog Session

Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Karthik Ramasamy offers an overview of the end-to-end real-time stack Twitter designed in order to meet this challenge, consisting of DistributedLog (the distributed and replicated messaging system) and Heron (the streaming system for real-time computation).

Avinash Ramineni is a principal at Clairvoyant and leads the engineering efforts in the big data space. He is a passionate technologist with a drive to understand the bigger picture and vision and convert them into pragmatic, implementable solutions. Avinash has over 13 years of experience in engineering and architecting systems on a large scale. He specializes in providing solutions in the areas of big data, cloud, NoSQL, SOA, and event-driven architectures. Before Clairvoyant, Avinash was a principal engineer at Apollo Group, where he was responsible for innovation and technical guidance for all the product development efforts. Avinash holds an MS in computer science from Arizona State University.

Presentations

Choice Hotels's journey to better understand its customers through self-service analytics Session

Narasimhan Sampath and Avinash Ramineni share how Choice Hotels International used Spark Streaming, Kafka, Spark, and Spark SQL to create an advanced analytics platform that enables business users to be self-reliant by accessing the data they need from a variety of sources to generate customer insights and property dashboards and enable data-driven decisions with minimal IT engagement.

Luis Ramos is a senior staff engineer at GE Digital who recently transitioned from GE Global Research, where he drove initiatives on industrial big data projects during early stages of Predix. Currently with the Predix Data Services team, Luis leads the Time Series Service development team. Prior to GE, he worked in startups, where he contributed to Hadoop ecosystem projects and built an analytics system that was used by major telecom companies including Verizon, Sprint, and T-Mobile for smartphone usage and MND. Luis holds a master’s degree in computer science from Cal State Fullerton.

Presentations

How GE analyzes billions of mission-critical events in real time using Apache Apex, Spark, and Kudu Session

Opportunities in the industrial world are expected to outpace consumer business cases. Time series data is growing exponentially as new machines get connected. Venkatesh Sivasubramanian and Luis Ramos explain how GE makes it faster and easier for systems to access (using a common layer) and perform analytics on a massive volume of time series data by combining Apache Apex, Spark, and Kudu.

Jun Rao is the cofounder of Confluent, a company that provides a streaming data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM’s Almaden research data center, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Apache Cassandra.

Presentations

Ask me anything: Apache Kafka AMA

Join Apache Kafka cocreator and PMC chair Jun Rao and Apache Kafka committer and architect of Kafka Connect Ewen Cheslack-Postava for a Q&A session about Apache Kafka. Bring your questions about Kafka internals or key considerations for developing your data pipeline and architecture, designing your applications, and running in production with Kafka.

Securing Apache Kafka Session

With Apache Kakfa 0.9, the community has introduced a number of features to make data streams secure. Jun Rao explains the motivation for making these changes, discusses the design of Kafka security, and demonstrates how to secure a Kafka cluster. Jun also covers common pitfalls in securing Kafka and talks about ongoing security work.

Tom Reilly is the CEO of Cloudera. Tom has had a distinguished 30-year career in the enterprise software market. Previously, Tom was vice president and general manager of enterprise security at HP; CEO of enterprise security company ArcSight, where he led the company through a successful initial public offering and subsequent sale to HP; and vice president of business information services for IBM, following the acquisition of Trigo Technologies Inc., a master data management (MDM) software company, where he served as CEO. He currently serves on the boards of Jive Software, privately held Ombud Inc., ThreatStream Inc., and Cloudera. Tom holds a BS in mechanical engineering from the University of California, Berkeley.

Presentations

Hadoop in the cloud: A Nielsen use case Keynote

Cloudera CEO Tom Reilly and James Powell, global CTO of Nielsen, discuss the dynamics of Hadoop in the cloud, what to consider at the start of the journey, and how to implement a solution that delivers flexibility while meeting key enterprise requirements.

Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.

Presentations

BI and SQL analytics with Hadoop in the cloud Session

Henry Robinson and Justin Erickson explain how to best take advantage of the flexibility and cost-effectiveness of the cloud with your BI and SQL analytic workloads using Apache Hadoop and Apache Impala (incubating), covering the architectural considerations, best practices, tuning, and functionality available when deploying or migrating BI and SQL analytic workloads to the cloud.

Julie Rodriguez is vice president of product management and user experience at Eagle Investment Systems. An experience designer focusing on user research, analysis, and design for complex systems, Julie has patented her work in data visualizations for MATLAB and publishes industry articles on user experience and data analysis and visualization. She is the coauthor of Visualizing Financial Data, a book about visualization techniques and design principles that includes over 250 visuals depicting quantitative data.

Presentations

Encoding new data visualizations Data 101

Julie Rodriguez introduces new visualization methods that provide greater clarity to your data, demonstrating how to show associations and links between datasets to understand the impact one value has on another, how to communicate time-lapsed data to understand the context of an event, and how to display multiple variables to analyze and compare attributes.

Office Hour with Julie Rodriguez (Sapient Global Markets) Office Hours

Talk to Julie about new visualization methods.

Antonio Rosales is an engineering manager at Canonical. Antonio has spent the past 15 years in the Unix/Linux community working with Sun Microsystem and IBM. He enjoys working on open source projects—specifically those that enable people to cycle their ideas faster and help realize their solution.

Presentations

Open source operations: Building on Apache Spark with InsightEdge, TensorFlow, Apache Zeppelin, and your own project Session

Antonio Rosales offers an overview of Juju, an open source method to distill the best practices and operations needed to use interconnected big data solutions. By providing an open source means to describe services and solutions, users can focus on using the science, and developers can focus on delivering best practices.

Taposh Roy is a technical data science executive and advisor with a passion for turning data into actionable insights, meaningful stories, and awesome products. He has a unique combination of product, technology, and strategy consulting, data science, and startup experience. Taposh is a consumer-focused, machine-learning and data science geek.

Presentations

Big data in healthcare Session

While other industries have embraced the digital era, healthcare is still playing catch-up. Kaiser Permanente has been a leader in healthcare technology and first started using computing to improve healthcare results in the 1960s. Taposh Roy, Rajiv Synghal, and Sabrina Dahlgren offer an overview of Kaiser’s big data strategy and explain how other organizations can adopt similar strategies.

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists, particularly focusing on the Apache Spark ecosystem. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka. Neelesh holds a master’s degree in computer science with a focus on cloud computing from North Carolina State University and a bachelor’s degree in computer engineering from the University of Mumbai, India.

Presentations

Breaking Spark: The top five mistakes to avoid when using Apache Spark in production Session

Drawing on his experiences across 150+ production deployments, Neelesh Srinivas Salian focuses on five common issues observed in a cluster environment setup with Apache Spark (Core, Streaming, and SQL) to help you improve the usability and supportability of Apache Spark and avoid such issues in future deployments.

Narasimhan Sampath is a systems architect at Choice Hotels International, one of the largest and most successful hotel chains in the world, where he works on enterprise big data and cloud architectures with a focus on performance tuning and scalability. Narasimhan also has rich experience in a variety of relational and NoSQL databases. He regularly presents at technology events and his work on scalability has been recognized and published by Microsoft.

Presentations

Choice Hotels's journey to better understand its customers through self-service analytics Session

Narasimhan Sampath and Avinash Ramineni share how Choice Hotels International used Spark Streaming, Kafka, Spark, and Spark SQL to create an advanced analytics platform that enables business users to be self-reliant by accessing the data they need from a variety of sources to generate customer insights and property dashboards and enable data-driven decisions with minimal IT engagement.

Melissa Santos has over a decade of experience with all parts of the data pipeline, from ETLs to modeling. Her role as a data scientist at Big Cartel involves teaching both engineers and nontechnical people how to get the data they need. Melissa holds a PhD in applied math.

Presentations

Creating and evaluating a distance measure Session

Whether we're talking about spam emails, merging records, or investigating clusters, there are many times when having a measure of how alike things are makes them easier to work with (e.g., with unstructured data that isn't incorporated into your data models). Melissa Santos offers a practical approach to creating a distance metric and validating with business owners that it provides value.

Anand Sanwal is the cofounder and CEO of CB Insights, a National Science Foundation-backed technology market intelligence platform that provides predictive intelligence into emerging technology trends, startups, and corporate strategy. Prior to founding CB Insights, Anand managed the $50 million Chairman’s Innovation Fund at American Express and worked in VC and corporate M&A. Earlier in his career, Anand worked at Kozmo.com, one of NYC’s most infamous dot-com flameouts, where he learned that if you buy something for $2 and sell it for $1, you will not make it up in volume. He holds degrees in chemical engineering from the University of Pennsylvania and in finance and accounting from the Wharton School.

Presentations

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

The future of fintech FinData

Anand Sanwal explores the trends, technologies, and business models that will disrupt financial services.

Brad Sarsfield is the principal software architect in data science for Microsoft HoloLens, where he focuses on making computing more personal by helping humans and machines understand each other through actionable data intelligence pipelines and distributed supervised machine-learning platforms. Prior to HoloLens, Brad was a founding member of Microsoft’s HDInsight Hadoop engineering team, where he helped port Apache Hadoop to Windows and put Hadoop on the Windows Azure Cloud platform. Previously, Brad was an software engineering lead on Microsoft’s Cosmos exabyte big data service. Brad spent several years in the SQL Server Engine after joining Microsoft in 2002. He holds a bachelor’s degree in computer science from the University of Western Ontario.

Presentations

Five-senses data: Using your senses to improve data signal and value Session

Data should be something you can see, feel, hear, taste, and touch. Drawing on real-world examples, Cameron Turner, Brad Sarsfield, Hanna Kang-Brown, and Evan Macmillan cover the emerging field of sensory data visualization, including data sonification, and explain where it's headed in the future.

Holographic data visualizations: Welcome to the real world Session

Data visualizations using interactive holograms help us make smarter decisions and explore ideas faster by inspecting every vantage point of our data and interacting with it in new, more personal and human ways. There are new rules for the new world. Join Brad Sarsfield as he explores and experiments with the possibilities of the next generation of data visualization experiences.

Kaz Sato is a staff developer advocate on the Cloud Platform team at Google, where he leads the developer advocacy team for machine-learning and data analytics products such as TensorFlow, the Vision API, and BigQuery. Kaz has been leading and supporting developer communities for Google Cloud for over seven years, is a frequent speaker at conferences, including Google I/O 2016, Hadoop Summit 2016 San Jose, Strata + Hadoop World 2016, and Google Next 2015 NYC and Tel Aviv, and has hosted FPGA meetups since 2013.

Presentations

Machine intelligence at Google scale Session

The largest challenge for deep learning is scalability. Google has built a large-scale neural network in the cloud and is now sharing that power. Kazunori Sato introduces pretrained ML services, such as the Cloud Vision API and the Speech API, and explores how TensorFlow and Cloud Machine Learning can accelerate custom model training 10x–40x with Google's distributed training infrastructure.

Amihai Savir is a seasoned data scientist and currently leading team of data scientists in EMC. Amihai is also a lecturer at Ben-Gurion University, where he has has taught a variety of subjects including C programing, advanced Java programing, data structures, algorithms, and complexity. Prior to joining EMC, he held several research and development positions in Israeli high-tech companies and in academia, where he focused on various aspects of data science and software engineering. Amihai holds a master’s degree in computer science from Ben-Gurion University, where he specialized in recommender systems and machine learning.

Presentations

Data science from idea to pilot to production: Challenges and lessons learned Data 101

In the age of big data analytics, smart monitoring and predicting abnormal behavior of corporation mission-critical systems can save large amounts of time and money. Drawing on a real-world case study from EMC, Amihai Savir examines the winding path from idea to viable solution in a corporate environment and walks you through challenges encountered and lessons learned.

Andrei Savu is a software engineer at Cloudera, where he’s working on Cloudera Director, a product that makes Hadoop deployments in cloud environments easy and more reliable for customers.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Berni Schiefer is an IBM fellow in the IBM Analytics group at the IBM Spark Technology Center, where he is responsible for a global team that focuses on the performance and scalability of products and solutions in the Analytics group, specifically for big data technologies, including Spark, BigInsights, Big SQL, dashDB, DB2 pureScale, and DB2 with BLU acceleration. Berni’s passion is in bringing advanced technology to market with a particular emphasis on exploiting processor, memory, networking, storage technology, and other hardware and software acceleration technologies. Since joining IBM Canada 1985, he has worked closely with many customers, ISVs, and business partners around the world. Berni holds a BSc in computer science from the University of Saskatchewan, from which he received the Alumni of Influence Award in 2016.

Presentations

ODPi: The foundation for cross-distribution interoperability Session

With so much variance across Hadoop distributions, ODPi was established to create standards for both Hadoop components and testing applications on those components. Join John Mertic and Berni Schiefer to learn how application developers and companies considering Hadoop can benefit from ODPi.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Practical machine learning Tutorial

Tianhui Li and Robert Schroll of the Data Incubator offer a foundation in building intelligent business applications using machine learning, walking you through all the steps to prototyping and production—data cleaning, feature engineering, model building and evaluation, and deployment—and diving into an application for anomaly detection and a personalized recommendation engine.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

Implementing extreme scaling and streaming in finance Session

Jim Scott outlines the core tenets of a message-driven architecture and explains its importance in real-time big data-enabled distributed systems within the realm of finance.

Yonik Seeley is the creator of Solr. He works at Cloudera integrating and leveraging “big search” technologies into the many components comprising the Cloudera Enterprise Data Hub (EDH). Yonik was previously chief open source architect and cofounder at LucidWorks.

Presentations

Parallel SQL and analytics with Solr Session

Yonik Seeley explores recent Apache Solr features in the areas of faceting and analytics, including parallel SQL, streaming expressions, distributed join, and distributed graph queries, as well as the trade-offs of different approaches and strategies for maximizing scalability.

Giannina Segnini is currently the director of the Data Concentration program at the Graduate School of Journalism at Columbia University. Previously, Giannina led a team of journalists and computer engineers fully dedicated to unfolding investigative stories by analyzing and visualizing public databases. Since 1994, she has led the Investigative unit at La Nación; its revelations have led to more than 50 criminal cases against politicians, businessmen, and officials. She also coaches cross-border investigations in Latin America and is part of the grand jury for the first Global Award on Data Journalism (Global Editors Network/Google).

Giannina has trained hundreds of journalists on investigative journalism and database journalism in Latin America, the US, Europe, and Asia and has served as trainer and consultant for several media and academic organizations such as the Journalism School at Columbia University in New York, News International in the UK, the Icelandic and Finnish Association of Investigative Journalism, O Globo and Folha de São Paulo in Brazil, El Tiempo in Colombia, El Nacional and Cadena Capriles in Venezuela, El Periódico and Siglo XXI in Guatemala, the Organization of American States (OAS) in Washington, DC, Freedom House, Inter American Press Association (IAPA), Universidad Nacional Autónoma de México (UNAM), the United Nations Development Programme (UNDP), Instituto de Prensa y Sociedad (IPYS), USAID, and Grupo de Diarios de América. She frequently speaks at high-level international conferences on investigative journalism, such as the Global Investigative Journalism Conference, the International Anti-Corruption Conference held by Transparency International, the International Press Institute (IPI), the News World Summit, and the Latin American Conference on Investigative Journalism.

Presentations

Connecting the dots through leaked and public data FinData

Offshore leaks, Lux leaks, Swiss leaks, Bahamas leaks, and the Panama Papers—all have one thing in common: they were all uncovered by the International Consortium of Investigative Journalists. Giannina Segnini and Mar Cabra explain how this global network of muckrakers uses technology to deal with big data and find cross-border stories that have worldwide impact.

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Jonathan Seidman is a software engineer on the partner engineering team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Ask me anything: Hadoop application architectures AMA

Mark Grover, Jonathan Seidman, and Ted Malaska, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Hadoop application architectures: Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing Tutorial

Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska demonstrate how to architect a modern, real-time big data platform and explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning.

Viral Shah is a data solutions architect at Asurion Services, where he helps over 290M customers globally stay connected while driving loyalty to their brands. As a global services provider, Viral’s responsibilities include providing data as a service to Asurion’s worldwide analytical users.

Presentations

Accelerating time to analytical value in the enterprise with data lake management Session

Viral Shah explains how enterprises like Asurion Services are leveraging big data management solutions to accelerate enterprise data lake initiatives for business value.

Ben Sharma is CEO and cofounder of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions and expertise ranging from development to production deployment in a wide array of technologies, including Hadoop, HBase, databases, virtualization, and storage. He has held technology leadership positions for NetApp, Fujitsu, and others. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes. He holds two patents.

Presentations

Building a modern data architecture Session

When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started.

Cloud computing and big data Data 101

Ben Sharma uses popular cloud-based use cases to explore how to effectively and safely leverage big data in the cloud to achieve business goals. Now is the time to get the jump on this trend before your competition gets the upper hand.

Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

Presentations

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX Tutorial

Vartika Singh and Jayant Shekhar walk you through techniques for building and tuning machine-learning apps using Spark MLlib and Spark ML Pipelines and graph processing with GraphX.

Rajesh Shroff is a big data solutions architect in Cisco’s Unified Computing Systems (UCS) and Data Center Solutions group. He has over 13 years of experience in the networking industry. His current focus is on developing big data solutions for data centers, specifically Cisco UCS servers. Rajesh holds a master’s degree in electrical engineering from the University of Southern California.

Presentations

Big data and analytics with Cisco UCS: Lessons learned and platform considerations Session

Rajesh Shroff reviews the big data and analytics landscape, lessons learned in enterprise over the last few years, and some of the key considerations while designing a big data system.

Max Shron is the head of data science at Warby Parker. Previously, Max was data strategist and founder of Polynumeral, which provided mentorship, statistical analysis, and software development to organizations across a wide array of sizes and industry verticals looking to advance their data science practice, and the lead data scientist at OkCupid, where he did data work for the widely read OkTrends blog. His personal work has appeared worldwide, including in the New York Times, Chicago Tribune, and the Huffington Post and on WNYC.

Presentations

Ask me anything: Getting into (and out of) data science consulting AMA

Join Max Shron, former consultant on data science and current head of Warby Parker's data science team, for a Q&A all about data science consulting. Bring your questions about getting into the data science consulting business (or your questions about how to transition from consulting to something new). Even if you don't have questions, join in to hear what others are asking.

Tanvi Singh is the Chief Analytics Officer, CCRO, at Credit Suisse. She leads a team of 60+ data scientists, data analysts, SME and Investigators globally in Zurich, New York, London, and Singapore. The team is delivering multimillion dollar projects in big data with leading Silicon Valley vendors in the space of RegTech. Tanvi has 18 years of experience in Data Science, business intelligence, Digital Analytics, Data platforms, Change and transformation with a focus on statistics, machine learning, text mining, and visualizations. Tanvi holds a master’s degree in software systems from the University of Zurich.

Presentations

Decision making beyond Excel: Using data science in banking compliance FinData

Private banking is a very traditional business and is in the midst of a major crisis regarding compliance, regulatory requirements, and investigations. Tanvi Singh highlights how the compliance function can be strengthened using data science to create a risk-based approach for assessing the health of the various divisions of the bank.

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Vartika Singh is a solutions architect at Cloudera with over 12 years of experience applying machine learning techniques to big data problems.

Presentations

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX Tutorial

Vartika Singh and Jayant Shekhar walk you through techniques for building and tuning machine-learning apps using Spark MLlib and Spark ML Pipelines and graph processing with GraphX.

Joseph Sirosh is the corporate vice president of the Cloud AI Platform at Microsoft, where he leads the company’s enterprise AI strategy and products such as Azure Machine Learning, Azure Cognitive Services, Azure Search, and Bot Framework. Prior to this role, he was the corporate vice president for Microsoft’s Data Platform. Joseph joined Microsoft from Amazon, where he was most recently the vice president for the Global Inventory Platform, responsible for the science and software behind Amazon’s supply chain and order fulfillment systems, as well as the central Machine Learning Group, which he built and led. Before joining Amazon, Joseph was vice president of research and development at Fair Isaac Corp., where he led R&D projects for DARPA, homeland security, and several government organizations. He is passionate about machine learning and its applications and has been active in the field since 1990. Joseph holds a PhD in computer science from the University of Texas at Austin and a BTech in computer science and engineering from the Indian Institute of Technology Chennai.

Presentations

Connected eyes Keynote

Will machine learning give us better eyesight? Join Joseph Sirosh for a surprising story about how machine learning, population data, and the cloud are coming together to fundamentally reimagine eye care in one of the world’s most populous countries, India.

Venkatesh Sivasubramanian is currently a Senior Director at GE Digital, where he drives the architecture and development of Data Services for Predix, an Industrial IoT platform. Prior to joining GE Digital, he worked as a lead engineer in the Big Fast Data team at WalmartLabs, building its stream processing engine and distributed systems. Venkatesh holds a master’s degree in software engineering from Birla Institute of Technology and Science (BITS), India.

Presentations

How GE analyzes billions of mission-critical events in real time using Apache Apex, Spark, and Kudu Session

Opportunities in the industrial world are expected to outpace consumer business cases. Time series data is growing exponentially as new machines get connected. Venkatesh Sivasubramanian and Luis Ramos explain how GE makes it faster and easier for systems to access (using a common layer) and perform analytics on a massive volume of time series data by combining Apache Apex, Spark, and Kudu.

Crystal Skelton is an associate in Kelley Drye & Warren’s Los Angeles office, where she represents a wide array of clients from tech startups to established companies in privacy and data security, advertising and marketing, and consumer protection matters. Crystal advises clients on privacy, data security, and other consumer protection matters, specifically focusing on issues involving children’s privacy, mobile apps, data breach notification, and other emerging technologies and counsels clients on conducting practices in compliance with the FTC Act, the Children’s Online Privacy Protection Act (COPPA), the Gramm-Leach-Bliley Act, the GLB Safeguards Rule, Fair Credit Reporting Act (FCRA), the Fair and Accurate Credit Transactions Act (FACTA), and state privacy and information security laws. She regularly drafts privacy policies and terms of use for websites, mobile applications, and other connected devices.

Crystal also helps advertisers and manufacturers balance legal risks and business objectives to minimize the potential for regulator, competitor, or consumer challenge while still executing a successful campaign. Her advertising and marketing experience includes counseling clients on issues involved in environmental marketing, marketing to children, online behavioral advertising (OBA), commercial email messages, endorsements and testimonials, food marketing, and alcoholic beverage advertising. She represents clients in advertising substantiation proceedings and other matters before the Federal Trade Commission (FTC), the US Food and Drug Administration (FDA), and the Alcohol and Tobacco Tax and Trade Bureau (TTB) as well as in advertiser or competitor challenges before the National Advertising Division (NAD) of the Council of Better Business Bureaus. In addition, she assists clients in complying with accessibility standards and regulations implementing the Americans with Disabilities Act (ADA), including counseling companies on website accessibility and advertising and technical compliance issues for commercial and residential products. Prior to joining Kelley Drye, Crystal practiced privacy, advertising, and transactional law at a highly regarded firm in Washington, DC, and as a law clerk at a well-respected complex commercial and environmental litigation law firm in Los Angeles, CA. Previously, she worked at the law firm featured in the movie Erin Brockovich, where she worked directly with Erin Brockovich and the firm’s name partner to review potential new cases.

Presentations

Big data, big decisions: Key legal considerations for the collection and use of big data Session

Companies making data-driven decisions must consider critical legal obligations that may apply to the collection and use of data. Failing to do so has landed many tech stars and startups in hot legal water. Attorneys Kristi Wolff and Crystal Skelton discuss privacy, data security, and other legal considerations for using data across several industry types.

Darryl Smith is the chief data platform architect and distinguished engineer at Dell Technologies, where he is responsible for the design of Dell Technologies’ business data lake, utilizing many open source technologies.

Presentations

Getting it right exactly once: Principles for streaming architectures Session

Hear the Chief Data Platform Architect of Dell Technologies outline streaming principles.

Mohit Soni is a distributed applications engineer at Mesosphere building the Datacenter Operating System (DCOS). Previously Mohit worked as an engineer on the Platform team at eBay, where he focused on maximizing efficiency, increasing agility, and reducing cost. Mohit has presented at DockerCon 2014, Hadoop India Summit 2011, and BarCamp 2010. You can follow him as mohitsoni on GitHub and Twitter.

Presentations

Elastic data services on Mesos via Mesosphere’s DC/OS Session

Adam Bordelon and Mohit Soni demonstrate how projects like Apache Myriad (incubating) can install Hadoop on Mesosphere DC/OS alongside other data center-scale applications, enabling efficient resource sharing and isolation across a variety of distributed applications while sharing the same cluster resources and hence breaking silos.

Ben Spivey is a principal solutions architect at Cloudera providing consulting services for large financial-services customers. Ben specializes in Hadoop security and operations. He is the coauthor of Hadoop Security from O’Reilly Media (2015).

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Many Hadoop clusters lack even basic security controls. Michael Yoder, Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You'll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Ram Sriharsha is the product manager for Apache Spark at Databricks and an Apache Spark committer and PMC member. Previously, Ram was architect of Spark and data science at Hortonworks and principal research scientist at Yahoo Labs, where he worked on scalable machine learning and data science. He holds a PhD in theoretical physics from the University of Maryland and a BTech in electronics from the Indian Institute of Technology, Madras.

Presentations

A deep dive into Structured Streaming in Spark Session

Structured Streaming is a new effort in Apache Spark to make stream processing simple without the need to learn a new programming paradigm or system. Ram Sriharsha offers an overview of Structured Streaming, discussing its support for event-time, out-of-order/delayed data, sessionization, and integration with the batch data stack to show how it simplifies building powerful continuous applications.

Ask me anything: The state of Spark AMA

Join Xiangrui Meng and Ram Sriharsha to discuss the state of Spark.

The state of Spark and what's next after Spark 2.0 Session

Ram Sriharsha reviews major developments in Apache Spark 2.0 and discusses future directions for the project to make Spark faster and easier to use for a wider array of workloads, with an emphasis on API evolution, single-node performance (Project Tungsten Phase 3), and Structured Streaming.

Jeremy is currently the VP of data science at Instacart, where he works closely with data scientists who are integrated into product teams to drive growth and profitability through logistics, catalog, search, consumer, shopper, and partner applications. Previously, Jeremy was chief data scientist and EVP of engineering at Sailthru, which builds data-driven solutions for marketers to drive long-term customer engagement and optimize revenue opportunities. As chief data scientist, he was responsible for the intelligence in the marketing personalization platform, which included prediction, recommendation, and optimization algorithms. As EVP of engineering, Jeremy led development, operations, database, and engineering support teams and partnered with the CTO to drive innovation and stability while scaling.

Earlier in his career, Jeremy was the CTO of Collective, where he led a team of product managers, engineers, and data scientists in creating technology platforms that used machine learning and big data to address challenging multiscreen advertising problems, and he founded and led the Global Markets Analytics group at Ernst & Young (EY), which analyzed the firm’s markets, financial and personnel data to inform executive decision making. His background in data-driven technology products spans a decade consulting with numerous global financial services firms on predictive modeling applications as a leader in the Customer Analytics Advisory practice at EY.

Presentations

Making on-demand grocery delivery profitable with data science Session

Fifteen years ago, Webvan spectacularly failed to bring grocery delivery online. Speculation has been high that the current wave of on-demand grocery delivery startups will meet similar fates. Jeremy Stanley explains why this time the story will be different—data science is the key.

Julie Steele thinks in metaphors and finds beauty in the clear communication of ideas. She is particularly drawn to visual media as a way to understand and transmit information. Julie is coauthor of Beautiful Visualization (O’Reilly, 2010) and Designing Data Visualizations (O’Reilly, 2012).

Presentations

Ask me anything: Developing a modern enterprise data strategy AMA

John Akred, Stephen O'Sullivan, and Julie Steele will field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for CDO and its evolving role. Even if you don’t have a specific question, join in to hear what others are asking.

Rupert Steffner is chief platform architect of Otto Group’s new business intelligence platform, BRAIN. In this role, Rupert is responsible for the entire setup as well as for initiating and managing the major change projects. Previously, he was head of the Marketing Department at the University of Applied Sciences, Salzburg and worked as a business intelligence leader for several European and US companies in a range of industries from ecommerce and retail to finance and telco. Rupert has over 25 years of experience in designing and implementing highly sophisticated technical and business solutions with a focus on customer-centric marketing. He holds an MBA from WU Vienna.

Presentations

AI-fueled customer experience: How online retailers are moving toward real-time perception, reasoning, and learning Session

Today’s online storefronts are good at procuring transactions but poor in managing customers. Rupert Steffner explains why online retailers must build a complementary intelligence to perceive and reason on customer signals to better manage opportunities and risks along the customer journey. Individually managed customer experience is retailers' next challenge, and fueling AI is the right answer.

Nathan Stephens recently joined RStudio as director of solutions engineering. His background is in applied analytics and consulting. He has experience building data science teams, creating innovative data products, analyzing big data, and architecting analytic platforms. He was an early adopter of R and has introduced it into many organizations. Nathan holds an MS in statistics from Brigham Young University.

Presentations

R for big data Tutorial

Garrett Grolemund and Nathan Stephens explore the new sparklyr package by RStudio, which provides a familiar interface between the R language and Apache Spark and communicates with the Spark SQL and the Spark ML APIs so R users can easily manipulate and analyze data at scale.

Robert Stratton is senior group director at MarketShare: A Neustar Solution, where he leads the development of marketing analytics and decision support technology using a host of modeling methods on Hadoop. Robert has 15 years’ experience leading and conducting analytics projects across a wide range of industries. He holds a PhD in computer science from King’s College London and a master’s degree in development economics.

Presentations

Beyond the numbers: Expanding the size of your analytic discovery team Session

Analytic discovery is a team sport; the lone hero data scientist is a thing of the past. John Akred of Silicon Valley Data Science leads a panel of analytics and data experts from Pfizer, the City of San Diego, and Neustar that explores how these businesses were changed through analytic collaboration.

Mike Stringer is cofounder and managing partner of consulting and design firm Datascope Analytics, where he has led or contributed to projects across a variety of industries for clients including Procter & Gamble and Thomson Reuters. Mike is passionate about realizing the potential for data to be used as a resource to make a positive impact on business and society. He also enjoys decidedly non-data-oriented activities, including exploring the amazing food in Chicago, playing and listening to music, and generally making things from scratch. Mike holds a BS in engineering physics from the University of Colorado and a PhD in physics from Northwestern University.

Presentations

Data science and the Internet of Things: It's just the beginning Session

We're likely just at the beginning of data science. The people and things that are starting to be equipped with sensors will enable entirely new classes of problems that will have to be approached more scientifically. Mike Stringer outlines some of the issues that may arise for business, for data scientists, and for society.

Brian Suda is a master informatician currently residing in Reykjavík, Iceland. Since first logging on in the mid-’90s, he has spent a good portion of each day connected to the internet. When he is not hacking on microformats or writing about web technologies, he enjoys taking kite aerial photography. His own little patch of internet can be found at Suda.co.uk, where many of his past projects, publications, interviews, and crazy ideas can be found.

Presentations

Introduction to visualizations using D3 Tutorial

Visualizations are a key part of conveying any dataset. D3 is the most popular, easiest, and most extensible way to get your data online in an interactive way. Brian Suda outlines best practices for good data visualizations and explains how you can build them using D3.

Susan Sun is a data expert with DataKind, a freelance data scientist working with Google’s Education team, and an instructor at General Assembly.

Presentations

Adventures from the frontlines of data for good Session

JeanCarlo Bonilla, Susan Sun, and Caitlin Augustin explore how DataKind volunteer teams navigate the road to social impact by automating evidence collection for conservationists and helping expand the reach of mobile surveys so that more voices can be heard.

Rajiv Synghal is principal of big data strategy at Kaiser Permanente. Previously, he held delivery and architecture roles in Fortune 100 organizations, including Visa and Nokia, and startups, such as Kivera. An accomplished strategic thinker and adviser to senior management on issues around growth, profitability, competition, and innovation, Rajiv is equally adept at presenting value propositions to top management and doing deep dives with fellow engineers. Rajiv is the rare kind of technology professional who carries within him the pragmatism of business urgency and the will to find a way to solve a problem no matter what it takes. He has demonstrated an uncanny ability to learn and teach new concepts, easily adapt to change, and manage multiple concurrent tasks. Rajiv is currently advising a number of startups in the big data space that are developing technologies to provide strategic solutions to challenges in the healthcare field.

Presentations

Big data in healthcare Session

While other industries have embraced the digital era, healthcare is still playing catch-up. Kaiser Permanente has been a leader in healthcare technology and first started using computing to improve healthcare results in the 1960s. Taposh Roy, Rajiv Synghal, and Sabrina Dahlgren offer an overview of Kaiser’s big data strategy and explain how other organizations can adopt similar strategies.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies Session

David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Jasjeet Thind is the vice president of data science and engineering at Zillow. His group focuses on machine-learned prediction models and big data systems that power use cases such as Zestimates, personalization, housing indices, search, content recommendations, and user segmentation. Prior to Zillow, Jasjeet served as director of engineering at Yahoo, where he architected a machine-learned real-time big data platform leveraging social signals for user interest signals and content prediction. The system powers personalized content on Yahoo, Yahoo Sports, and Yahoo News. Jasjeet holds a BS and master’s degree in computer science from Cornell University.

Presentations

Zillow: Transforming real estate through big data and data science Session

Zillow pioneered providing access to unprecedented information about the housing market. Long gone are the days when you needed an agent to get comparables and prior sale and listing data. And with more data, data science has enabled more use cases. Jasjeet Thind explains how Zillow uses Spark and machine learning to transform real estate.

Rob Thomas is vice president of business development in IBM’s Information Management Software division, where he leads business development for information management software, including IBM’s enterprise data management and information integration products, and is responsible for mergers and acquisitions, channel strategy and sales, and major ISV and SI partnerships. Recently, he led IBM’s acquisition of Netezza. Rob brings extensive experience in management, business development, and consulting in the high technology and financial services industries and has worked extensively with global businesses.

Previously, Rob worked within IBM’s engineering services and semiconductor business in Asia Pacific, where he was responsible for the $1B operation, which included custom product design and development, as well as manufacturing, and led a team with locations throughout Asia, including development centers in Japan, China, and India. In this capacity, Rob personally managed key engagements with Nintendo (microprocessor design for the Wii), Sony, Konica Minolta, Lenovo, CEC/Greatwall, Samsung, and other leading electronics companies. Rob started with IBM in IBM Global Business Services (GBS). As a partner in GBS, he led the sales and execution of a variety of consulting engagements, focused on strategy and change, operational improvement, and IT implementations to solve specific business issues. Prior to joining IBM, he was an equity research associate at Merrill Lynch and Wheat First Securities, developing a competence in how to value company operations and explore strategic alternatives for companies in a variety of industries. Rob has published articles in a variety of publications, including InfoWorld and Silicon Valley Business Ink, focused on issues and trends in the IT industry, the high technology business, and strategy. Rob is an avid golfer and runner; he lives in New Canaan, CT, with his wife and three children.

Presentations

Bring data to life with Immersive Visualization Keynote

Data has long stopped being structured and flat, but the results of our analysis are still rendered as flat bar charts and scatter plots. We live in a 3D world, and we need to be able to enable data interaction from all perspectives. Robert Thomas offers an overview of Immersive Visualization—integrated with notebooks and powered by Spark—which helps bring insights to life.

Zoltan Toth is a freelance data engineer and trainer with over 15 years of experience developing data-intensive applications. Zoltan spends most of his time helping companies kick off and mature their data analytics infrastructure and regularly gives Hadoop, big data, and ​Spark trainings. Zoltan built Prezi’s big data infrastructure and later led Prezi’s data engineering team, scaling it to serve 60 million users backed by a data volume over a petabyte. He also worked on big data and Spark-integration projects with RapidMiner, a global leader in predictive analytics. Besides working with data analytics architectures, Zoltan teaches at Central European University, one of the best independent universities in Europe.

Presentations

Spark camp: Exploring Wikipedia with Spark Tutorial

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Zoltan Toth explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.

Steve Touw is the cofounder and CTO of Immuta. Steve has a long history of designing large-scale geotemporal analytics across the US intelligence community, including some of the very first Hadoop analytics as well as frameworks to manage complex multitenant data policy controls. He and his cofounders at Immuta drew on this real-world experience to build a software product to make data experimentation easier. Previously, Steve was the CTO of 42Six Solutions (acquired by Computer Sciences Corporation), where he led a large big data services engineering team. Steve holds a BS in geography from the University of Maryland.

Presentations

Sensitive data sharing for analytics Session

Sharing your valuable data internally or with third-party consumers can be risky due to data privacy regulations and IP considerations, but sharing can also generate revenue or help nonprofits succeed at world-changing missions. Steve Touw explores real-world examples of how a proper data architecture enables philanthropic missions and offers ideas for how to better share your data.

Matt Turck is a managing director of FirstMark Capital, where he invests across a broad range of early-stage enterprise and consumer startups, with a particular focus on big data, AI, and frontier tech companies. Previously, Matt was a managing director at Bloomberg Ventures, the investment and incubation arm of Bloomberg LP, which he helped start, and the cofounder of TripleHop Technologies, a venture-backed enterprise search software startup that was acquired by Oracle. Matt is passionate about building communities and organizes two large monthly events, Data Driven NYC (which focuses on data-driven startups, big data, and AI) and Hardwired NYC (which focuses on frontier tech, including the Internet of Things, AR/VR, drones, and other emerging technologies). Matt graduated from Sciences-Po (IEP) Paris and holds a master of laws (LLM) from Yale Law School. He blogs at Mattturck.com.

Presentations

Why is this disruption different from all other disruptions? Hadoop as a game changer in financial services Session

What's the point at which Hadoop tips from a Swiss-army knife of use cases to a new foundation that rearranges how the financial services marketplace turns data into profit and competitive advantage? This panel of expert practitioners looks into the near future to see if the inflection point is at hand.

Combining an extensive background in product research, data analysis, program management, and software development, Cameron Turner cofounded the Data Guild in 2013. Previously, he founded ClickStream Technologies, which was acquired by Microsoft. While at Microsoft, Cameron managed the Windows Telemetry team, responsible for all inbound data for all Microsoft products and partners. He is an active member of and speaker at a number of Bay Area tech groups, including Churchill Club, SOFTECH, the Young CEOs Club, the CIO Roundtable, and BayCHI. Cameron holds a BA in architecture from Dartmouth College, an MBA from Oxford University, and an MS in statistics from Stanford University.

Presentations

Five-senses data: Using your senses to improve data signal and value Session

Data should be something you can see, feel, hear, taste, and touch. Drawing on real-world examples, Cameron Turner, Brad Sarsfield, Hanna Kang-Brown, and Evan Macmillan cover the emerging field of sensory data visualization, including data sonification, and explain where it's headed in the future.

Mark Turner is the text analytics lead at Teradata Aster, where he specializes in developing text analytics applications in a wide range of industries, including financial services, manufacturing, oil and gas, retail, and cable media. Prior to joining Teradata Aster, Mark was manager of the Natural Language Processing (NLP) Lab at Thomson Corporation (now Thomson Reuters) and was director of an applied research and development group at CACI, a major federal contractor. He also contributed to the development of the Unified Medical Language System (UMLS), a major biomedical vocabulary resource, at NIH. Mark holds an AB in linguistics from the University of Chicago and an MS in information and computer science from Georgia Tech. He was also a visiting scientist at Carnegie Mellon University.

Presentations

What ties to what? Visualizing large-scale customer text data with bipartite graphs Session

Which suppliers are most likely to have delivery or quality issues? Does service, product placement, or price make the biggest difference in customer sentiment? Text data from sources like email and social media can give answers. Mark Turner explains how to see the associations between any two variables in text data by combining text analytics and the bipartite graph visualization technique.

Kostas Tzoumas is a PMC member of the Apache Flink project and cofounder of data Artisans, the company founded by the original development team that created Flink. Kostas has spoken extensively about Flink, including at Hadoop Summit San Jose 2015.

Presentations

Implementing streaming architecture with Apache Flink: Present and future Session

Apache Flink has seen incredible growth during the last year, both in development and usage, driven by the fundamental shift from batch to stream processing. Kostas Tzoumas demonstrates how Apache Flink enables real-time decisions, makes infrastructure less complex, and enables extremely efficient, accurate, and fault-tolerant streaming applications.

Mauricio Vacas is a data engineer at Silicon Valley Data Science, where he has developed in multiple areas of the data platform from ingestion to analysis and visualization. Previously, Mauricio was a tech arch manager with 5+ years of experience working in Accenture’s R&D group and its big data practice, where he managed the development of a cloud-based data platform used by Accenture’s data science teams to create analytic models for multiple customer projects. Mauricio is passionate about technology and its ability to make a difference in people’s lives.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred, Mauricio Vacas, and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Crystal Valentine is the vice president of technology strategy at MapR Technologies. She has nearly two decades’ experience in big data research and practice. Previously, Crystal was a consultant at Ab Initio, where she worked with Fortune 500 companies to design and implement high-throughput, mission-critical applications and with equity investors as a technical expert on competing technologies and market trends. She was also a tenure-track professor in the Department of Computer Science at Amherst College. She is the author of several academic publications in the areas of algorithms, high-performance computing, and computational biology and holds a patent for extreme virtual memory. Crystal was a Fulbright Scholar in Italy and holds a PhD in computer science from Brown University as well as a bachelor’s degree from Amherst College.

Presentations

The keys to an event-based microservices application Session

Crystal Valentine draws on lessons learned from companies like Uber and Ericsson to outline the key principles to developing a microservices application. Along the way, Crystal describes how certain next-gen application areas—such as machine learning—are particularly well suited to implementation in a microservices architecture rather than a legacy application paradigm.

Using parallel graph-processing libraries for cancer genomics Session

Crystal Valentine explains how the large graph-processing frameworks that run on Hadoop can be used to detect significantly mutated protein signaling pathways in cancer genomes through a probabilistic analysis of large protein-protein interaction networks, using techniques similar to those used in social network analysis algorithms.

Bryan Van de Ven is a software engineer at Continuum Analytics. Previously, Bryan worked at the Applied Research Labs, developing software for sonar feature detection and classification systems on US Naval submarine platforms, and Enthought, where he worked on problems in financial risk modeling and fluid mixing simulation. Bryan has also worked on an assortment of iOS projects as an independent consultant. Bryan is a core contributor of Bokeh and contributed to the Chaco visualization library. Bryan holds undergraduate degrees in computer science and mathematics from UT Austin and a master’s degree in physics from UCLA.

Presentations

Interactive data applications in Python Tutorial

Bryan Van de Ven and Sarah Bird demonstrate how to build intelligent apps in a week with Bokeh, Python, and optimization.

Jen van der Meer is the founder and CEO of Reason Street, where she creates business models for social impact. A former Wall Street analyst and economist, Jen is a data doyen who masters the emerging edge of technological change. Throughout her career, Jen has practiced an approach that is equal parts data-driven and creative to understand and apply the opportunities for technology to transform the economy, society, and culture. Jen has held executive management roles at Organic, Frog Design, Dachis Group, and Luminary Labs. She was previously a partner at Drillteam where she developed innovation crowdsourcing programs for companies such as Target, Toyota, Nestle, and Neiman Marcus, earning the first social media programs in the industry. Jen organized the acquisition of Drillteam to Powered, Inc. in 2010, which was then merged into Dachis Group in 2011. She is actively engaged in the local startup community in New York City and is a vocal supporter of the open data movement. She is also an adjunct professor at NYU ITP and SVA’s Products of Design. Jen has a BA in comparative religion from Trinity College and an MBA from HEC in Paris. You can reach Jen at Jenvandermeer.org.

Presentations

Data case studies Tutorial

The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.

Bart van Leeuwen combines his 18 years of firefighting experience with 20 years of experience at Netage, a combination that allows him a new perspective on operational information delivery. With the ever larger amount of available data and changes in tactical approaches to firefighting, new and fresh thinking is needed. An “outside the box” thinker, Bart helps fire departments approach their information problems in a different way. In this process, technology is not the answer; it’s an enabler and should be treated as such. Currently, Bart is leading an innovation project that combines proven information management technology with new paradigms like semantic web technology to deal with information flows in a smarter and more agile way.

Presentations

Smart data for smarter firefighters Session

Smart data allows fire services to better protect the people they serve and keep their firefighters safe. The combination of open and nonpublic data used in a smart way generates new insights both in preparation and operations. Bart van Leeuwen discusses how the fire service is benefiting from open standards and best practices.

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, where she is responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Amit Vij is CEO of Kinetica, where he is responsible for the company vision and administrative and executive decisions. Amit is a subject-matter expert on geospatial intelligence and helped architect Kinetica, the fastest in-memory analytics platform for high-velocity data powered by graphics processing units (GPUs). Amit’s background is in computer engineering, and he has over a decade of software development experience in the commercial and federal space, with an emphasis on analyzing and visualizing big data. He is also a web entrepreneur, having successfully created and sold a leading consumer website that has been featured in several television news broadcasts and printed publications. Amit holds a BS in computer engineering from the University of Maryland at College Park with a concentration in the fields of computer science, electrical engineering, and mathematics.

Presentations

Making real-time analytics on the data lake a reality Session

Data lakes provide large-scale data processing and storage at low cost but struggle to deliver real-time analytics without investment in large clusters. If you need subsecond analytic response on streaming data, consider a GPU database. Amit Vij and Mark Brooks outline the dramatic performance benefits a GPU database offers and explain how to integrate it with Hadoop.

Sriram Vishwanath is the president of Accordion Health and a professor in the Cockrell School of Engineering at the University of Texas at Austin, as well as a serial entrepreneur. He has roughly 20 years of experience in the domains of informatics, statistics, and data science and has authored over 150 refereed papers on the subjects. Sriram holds a PhD from Stanford University and an MS from Caltech.

Presentations

Transforming healthcare through precision data science: Myths and facts Keynote

Healthcare, a $3 trillion industry, is ripe for disruption through data science. However, there are many challenges in the journey to make healthcare a truly transparent, consumer-centric, data-driven industry. Sriram Vishwanath shares some myths and facts about data science's impact on healthcare.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the creation of the Lightbend Fast Data Platform, a streaming data platform built on the Lightbend Reactive Platform, Kafka, Spark, Flink, and Mesosphere DC/OS. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects. He’s also the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Just enough Scala for Spark Tutorial

Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs.

Peter Wang is the cofounder and CTO of Anaconda, where he leads the product engineering team for the Anaconda platform and open source projects including Bokeh and Blaze. Peter has been developing commercial scientific computing and visualization software for over 15 years and has software design and development experience across a broad variety of areas, including 3D graphics, geophysics, financial risk modeling, large data simulation and visualization, and medical imaging. As a creator of the PyData conference, he also devotes time and energy to growing the Python data community by advocating, teaching, and speaking about Python at conferences worldwide. Peter holds a BA in physics from Cornell University.

Presentations

Successful open data science on Hadoop: From sandbox to production Session

Although Python and R promise powerful data science insights, they can also be complex to manage and deploy with Hadoop infrastructure. Peter Wang distills the vast array of Hadoop and data science tools and architectures down to the essentials that deliver a powerful and lightweight stack quickly so that you can accelerate time to value while meeting your data science, governance, and IT needs.

Sam Wang pioneered the science of data-based political prediction later popularized by Nate Silver. Like Nate, Sam has harnessed the power of big data to yield astonishingly accurate predictions of the last several American elections. In 2008 and 2012, Sam called 49 out of 50 states correctly; in 2004, he was 50 for 50. Sam’s project, the Princeton Election Consortium, analyzes polls across the US to create a single accurate snapshot of the nation at any moment. His methods have yielded highly accurate predictions of final results, high-resolution tracking of the race during the campaign, and key insights into where campaign money can be most valuably spent. The Princeton Election Consortium and its predecessor, the Meta-Analysis of State Polls, have attracted millions of hits and tens of thousands of visits per day during campaigns.

Sam’s statistical analysis in 2012 correctly predicted the presidential vote outcome in 49 of 50 states and even the two-candidate popular vote of 51.1% to 48.9%. That year, the Princeton Election Consortium also correctly called 10 out of 10 close Senate races and came within a few seats of the final House outcome. A highly recognized biophysicist and neuroscientist, Sam has also applied his statistical and probabilistic methods to complex experimental data with great success. He is the author of Welcome to Your Child’s Brain and Welcome to Your Brain, which have both been translated into over 20 languages worldwide. Sam was selected by the American Association for the Advancement of Science as a congressional science and engineering fellow. He also served on the staff of the US Senate Committee on Labor and Human Resources.

Presentations

Statistics, machine learning, and the crazy 2016 election Keynote

Although 2016 is a highly unusual political year, elections and public opinion follow predictable statistical properties. Sam Wang explains how the presidential, Senate, and House races can be tracked and forecast from freely available polling data using tools from statistics and machine learning.

Shuguang Wang is a senior data scientist at the Washington Post. He enjoys the process of making sense of large amounts of data and is very skilled in various machine-learning and data-mining tools. At the Washington Post, Shuguang has successfully built several predictive models to tackle business challenges. Previously, he was a software development engineer at Amazon Web Services. Shuguang holds an Msc in computer science from the University of Pittsburgh.

Presentations

How the Washington Post uses machine learning to predict article popularity Session

Predicting which stories will become popular is an invaluable tool for newsrooms. Eui-Hong Han and Shuguang Wang explain how the Washington Post predicts what stories on its site will be popular with readers and share the challenges they faced in developing the tool and metrics on how they refined the tool to increase accuracy.

Sara M. Watson is a technology critic and writer in residence at Digital Asia Hub. She is also a Research Fellow at the Tow Center for Digital Journalism at Columbia University and an affiliate with the Berkman Klein Center for Internet and Society at Harvard University.

Her work explores how we are learning to live with, understand, and interpret our personal data and the algorithms that shape our experiences. She investigates the ways that corporations, governments, and individuals use data from wearable sensors, the internet of things, and other digitally processed systems. Sara immerses herself in emerging technologies to understand its personal impacts firsthand. Her writing has appeared in The Atlantic, Wired, Gizmodo, Columbia Journalism Review, Harvard Business Review, Al Jazeera America, and Slate.

Sara has previously worked as an enterprise technology analyst at The Research Board (Gartner, Inc.), exploring implications of technological trends for Fortune 500 CIOs. Sara holds an MSc in the Social Science of the Internet with distinction from the Oxford Internet Institute, where her award winning thesis used ethnographic methods to examine the personal data interests of the Quantified Self community. She graduated from Harvard College magna cum laude with a joint degree in English and American Literature and Film Studies. Sara is currently based in Singapore, and keeps close ties to Cambridge and New York. She tweets @smwat.

Presentations

The personalization spectrum Session

How are users meant to interpret the influence of big data and personalization in their targeted experiences? What signals do we have to show us how our data is used, how it improves or constrains our experience? Sara Watson explains that in order to develop normative opinions to shape policy and practice, users need means to guide their experience—the personalization spectrum.

Mike Wendt is a Manager of Applied Solutions Engineering at NVIDIA. His research work has focused on leveraging GPUs for big data analytics, data visualizations, and stream processing. Prior to joining NVIDIA, Mike led engineering work on big data technologies like Hadoop, Datastax Cassandra, Storm, Spark, and others. In addition, Mike has focused on developing new ways of visualizing data and the scalable architectures to support them. Mike holds a BS in computer engineering from the University of Maryland.

Presentations

Streaming cybersecurity into Graph: Accelerating data into Datastax Graph and Blazegraph Session

Cybersecurity has become a data problem and thus needs the best-in-breed big data tools. Joshua Patterson, Michael Wendt, and Keith Kraus explain how Accenture Labs's Cybersecurity team is using Apache Kafka, Spark, and Flink to stream data into Blazegraph and Datastax Graph to accelerate cyber defense.

Jonathon Whitton is director of data services for PRGX, a global account services company that helps clients better manage and leverage their AP and supplier data, where he is leading a big data initiative that has resulted in 10x faster processing of client data, lowering the cost of storage and creating increased availability to the data for its business partners in the recovery audit, profit optimization, fraud prevention, healthcare, and oil and gas business lines. Jonathon has over 20 years of experience in technology, specializing in big data, Hadoop, process transformation, migration, and business analysis. Previously, he was licensed in NY, NJ, and CT to provide insurance-related advice as a financial planner; was also a top-rated technical instructor with ExecuTrain; and served in the 1/75 Ranger Regiment. Jonathon holds an MBA from Kennesaw State University and a bachelor’s degree from Duke University.

Presentations

Turning petabytes of data into millions in cost savings for the world’s biggest retailers Session

Jonathon Whitton details how PRGX is using Talend and Cloudera to load two million annual client flat files into a Hadoop cluster and perform recovery audit services in order to help clients detect, find, and fix leakage in their procurement and payment processes.

Martin Wicke is a software engineer working on making sure that TensorFlow is a thriving open source project. Before joining Google’s Brain team, Martin worked in a number of startups and did research on computer graphics at Berkeley and Stanford.

Presentations

Ask me anything: Deep learning with TensorFlow AMA

Martin Wicke and Josh Gordon field questions related to their tutorial, Deep Learning with TensorFlow.

Deep learning with TensorFlow Tutorial

Martin Wicke and Josh Gordon offer hands-on experience training and deploying a machine-learning system using TensorFlow, a popular open source library. You'll learn how to build machine-learning systems from simple classifiers to complex image-based models as well as how to deploy models in production using TensorFlow Serving.

Removing complexity from scalable machine learning Session

Much of the success of deep learning in recent years can be attributed to scale—bigger datasets and more computing power—but scale can quickly become a problem. Distributed, asynchronous computing in heterogenous environments is complex, hard to debug, and hard to profile and optimize. Martin Wicke demonstrates how to automate or abstract away such complexity, using TensorFlow as an example.

Cheryl Wiebe is the Analytics of Things Practice lead for the Advanced Analytics Center of Expertise within Think Big, a Teradata company. The digitalization megatrend has been central to her work over the past 15 years. She has consulted with Global 1000 clients who are embedding and operationalizing analytics at scale into the industrial Internet, smart cities, connected factories, smart grids, and other emerging IoT landscapes. Cheryl collaborates with business and advanced analytics professionals to accelerate their capabilities and professionalize their advanced analytics practices. Her recent projects have focused on clients who are initiating their Internet of Things journey in two areas: how to align business strategy with analytics investments to get started on the road to being competitors in analytics of things and how to build a robust data infrastructure to make sensor data and the many other related data types available at scale to allow analytics of things to be built into industrial analytic applications.

Presentations

Big data meets the IoT Session

The IoT is fundamentally transforming industries and reconfiguring the technology landscape, but challenges exist for enterprises to effectively realize the value from this next wave of information and opportunity. Cheryl Wiebe explores how leading companies harness the IoT by putting IoT data in context, fostering collaboration between IT and OT and enabling a new breed of scalable analytics.

Edd Wilder-James is a strategist at Google, where he is helping build a strong and vital open source community around TensorFlow. A technology analyst, writer, and entrepreneur based in California, Edd previously helped transform businesses with data as vice president of strategy for Silicon Valley Data Science. Formerly Edd Dumbill, Edd was the founding program chair for the O’Reilly Strata conferences and chaired the Open Source Convention for six years. He was also the founding editor of the peer-reviewed journal Big Data. A startup veteran, Edd was the founder and creator of the Expectnation conference-management system and a cofounder of the Pharmalicensing.com online intellectual-property exchange. An advocate and contributor to open source software, Edd has contributed to various projects such as Debian and GNOME and created the DOAP vocabulary for describing software projects. Edd has written four books, including O’Reilly’s Learning Rails.

Presentations

Beyond the numbers: Expanding the size of your analytic discovery team Session

Analytic discovery is a team sport; the lone hero data scientist is a thing of the past. John Akred of Silicon Valley Data Science leads a panel of analytics and data experts from Pfizer, the City of San Diego, and Neustar that explores how these businesses were changed through analytic collaboration.

Developing a modern enterprise data strategy Tutorial

How do you reconcile the business opportunity of big data and data science with the sea of possible technologies? Fundamentally, data should serve the strategic imperatives of a business—those key aspirations that define an organization’s future vision. Edd Wilder-James and Colette Glaeser explain how to create a modern data strategy that powers data-driven business.

The business case for Spark, Kafka, and friends Data 101

Spark is white-hot at the moment, but why does it matter? Developers are usually the first to understand why some technologies cause more excitement than others. Edd Wilder-James relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2016 to explain why they’re exciting in terms of both new capabilities and the new economies they bring.

Tim Williamson is a data scientist at Monsanto, where he leads a full stack data engineering team focused on creating distributed analysis capabilities around complex scientific datasets in genomics, genetics, and agronomic performance.

Presentations

Using graph databases to operationalize insights from big data Session

Tim Williamson and Emil Eifrem explain how organizations can use graph databases to operationalize insights from big data, drawing on the real-life example of Monsanto’s use of graph databases to conduct real-time graph analysis of the company’s data to transform the business in ways that were previously impossible.

One of the original founders of Radish Lab, Edward is a skilled interactive storyteller, digital technologist, and data scientist. Combining a rich understanding of creative design and standards-based development, Edward has spent more than a decade creating immersive and interactive digital experiences. He is currently Radish Lab’s technical director, and works to manage the tech team and implement innovative tech solutions to complex challenges that social impact organizations face. He’s truly passionate about climate change and exploring how the humanization and visualization of data can help it resonate and drive action across audiences.

Presentations

Shifting cities: A case study in data visualization Session

Radish Lab teamed up with science news nonprofit Climate Central to transform temperature data from 1,001 US cities into a compelling, simple interactive that received more than 1 million views within three days of launch. Alana Range and Brian Kahn offer an overview of the process of creating a viral, interactive data visualization with teams that regularly produce powerful data stories.

Presentations

Where's the puck headed? Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road. Join us to hear about the trends that everyone is seeing and areas for investment that they find exciting.

Kristi Wolff is special counsel in Kelley Drye’s Washington, DC, office. Kristi’s practice focuses on food, dietary supplements, medical devices, and emerging health/wearable technology and privacy issues. Kristi has extensive experience advising clients whose products are within the overlapping jurisdictions of the Food and Drug Administration and the Federal Trade Commission. She handles matters across the full product life-cycle, including concept analysis, claim substantiation, label review, quality and recall scenarios, and contested matters involving the FTC, FDA, National Advertising Division, state attorneys general, and class-action litigation. In addition, she regularly counsels clients regarding network marketing compliance and calling/texting practices and works with several lifestyle brands in the apparel, jewelry, and fur industries.

Prior to joining Kelley Drye, Kristi was associate general counsel at Nestlé HealthCare Nutrition, Inc., a subsidiary of Nestlé SA, the world’s largest health, nutrition, and wellness company. She also served as senior counsel at Assurant Health, where she performed risk-management functions, including litigation and regulatory tasks. During law school, Kristi worked at the Wisconsin Department of Justice and the US Court of Appeals for the Ninth Circuit. Having served as in-house counsel in the healthcare and food products industries, Kristi is particularly attuned to balancing business objectives with legal considerations. Her skill in the consumer protection area was recently recognized as she was named a 2015 Washington, DC, Rising Star by Super Lawyers magazine.

Presentations

Big data, big decisions: Key legal considerations for the collection and use of big data Session

Companies making data-driven decisions must consider critical legal obligations that may apply to the collection and use of data. Failing to do so has landed many tech stars and startups in hot legal water. Attorneys Kristi Wolff and Crystal Skelton discuss privacy, data security, and other legal considerations for using data across several industry types.

Susan Woodward is a financial economist with specialties in consumer finance and venture capital. She has been chief economist at two different federal agencies: the SEC (Securities & Exchange Commission) and HUD (Housing & Urban Development). Previously, she taught course on investments and corporate finance and worked as an expert in disputes in consumer finance, including for state attorney generals. Susan maintains a database for venture-funded companies and provides estimates of company value to VentureSource.

Presentations

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Measuring risk and staleness in venture reporting FinData

Susan Woodward discusses the venture capital funnel (as seen from the VentureSource database with additional research on failed companies from Sand Hill Econometrics) and analyzes the extent of stale pricing in venture values (and buyout fund values) in reports to investors. Did FAS 157, effective 2008, make any difference? Not much, as Susan explains. Reported values are still quite stale.

US venture: Risk, values, founder outcomes Keynote

Susan Woodward discusses venture outcomes—what fraction make lots of money, which just barely return capital, and which fraction fail completely. Susan uses updated figures on the fraction of entrepreneurs who succeed, including some interesting details on female founders of venture companies.

Ian Wrigley is the technology evangelist at StreamSets, the company behind the industry’s first data operations platform. Over his 25-year career, Ian has taught tens of thousands of students subjects ranging from C programming to Hadoop development and administration.

Presentations

An introduction to Apache Kafka Tutorial

Ian Wrigley demonstrates how to leverage the capabilities of Apache Kafka to collect, manage, and process stream data for both big data projects and general-purpose enterprise data integration. Ian covers system architecture, use cases, and how to write applications that publish data to, and subscribe to data from, Kafka—no prior knowledge of Kafka required.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud services and data engineering. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Presentations

Deploying and managing Hive, Spark, and Impala in the public cloud Tutorial

Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.

Fangjin Yang is a coauthor of the open source Druid project and a cofounder of Imply, a data analytics startup based in San Francisco. Previously, Fangjin held senior engineering positions at Metamarkets and Cisco Systems. Fangjin has a BASc in electrical engineering and an MASc in computer engineering from the University of Waterloo, Canada.

Presentations

An introduction to Druid Session

Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks suboptimal choices to power interactive applications. Fangjin Yang discusses using Druid for analytics and explains why the architecture is well suited to power analytic dashboards.

Yuhao Yang is a software engineer at Intel, where he provides implementation, consulting, and tuning advice on the Hadoop ecosystem to industry partners. Yuhao’s area of focus is distributed machine learning, especially large-scale analytical applications and infrastructure on Spark. He’s also an active contributor to Spark MLlib (50+ patches), has delivered the implementation of online LDA, QR decomposition, and several transformers of Spark feature engineering, and has provided improvements on some important algorithms.

Presentations

Apache Spark in fintech: Building fraud detection applications with distributed machine learning at Intel Session

Through collaboration with some of the top payments companies around the world, Intel has developed an end-to-end solution for building fraud detection applications. Yuhao Yang explains how Intel used and extended Spark DataFrames and ML Pipelines to build the tool chain for financial fraud detection and shares the lessons learned during development.

Chuck Yarbrough is the senior director of solutions marketing and management at Pentaho, a leading big data analytics company that helps organizations engineer big data connections, blend data, and report and visualize all of their data. Chuck is responsible for creating and driving Pentaho solutions that leverage the Pentaho platform, enabling customers to implement big data solutions quicker and achieve greater ROI faster. Chuck has more than 20 years of experience helping organizations use technology to their advantage to ensure they can run, manage, and transform their business through better use of data. A lifelong participant in the data game, Chuck has held leadership roles at Deloitte Consulting, SAP Business Objects, Hyperion, and National Semiconductor.

Presentations

Filling the data lake Session

It’s hard to get data into a data lake. Organizations hand-code their way through this, but with hundreds of data sources, it soon becomes unmanageable. Chuck Yarbrough offers a solution that uses metadata to autogenerate ingestion processes. Teams can drive hundreds of Hadoop onboarding processes through just a few templates, reducing development time and risk.

Andrew Yeung is head of product marketing at ClearStory Data, where he is responsible for driving go-to-market strategy and other outbound marketing activities. A technology-marketing leader with over 15 years of experience in data management, IT, and analytics, Andrew previously led product management and product marketing at CA Technologies, BEA Systems, and Oracle.

Presentations

Five ways to modernize your BI tools and make them work on more data Session

More data exists than ever before and in more disparate silos. Getting the insights you need, sifting through data, and answering new questions have all been complex, hairy tasks that only data jocks have been able to do. Andrew Yeung and Scott Anderson explore new ways to challenge the status quo and speed insights on diverse sources and demonstrate real customer use cases.

Martin Yip is a product line marketing manager for VMware’s Cloud Platform business unit, where he oversees product marketing for a portfolio of products including vSphere, vSphere with Operations Management, and Big Data. Martin has been in the high technology industry for over 10 years in various capacities, from engineering and consulting to marketing. He holds a bachelor’s degree in computer science from the University of California, Berkeley, and an MBA from the University of Southern California’s Marshall School of Business.

Presentations

Virtualizing big data: Effective approaches derived from real-world deployments Session

The trend of deploying Hadoop on virtual infrastructure is rapidly increasing. Martin Yip explores the benefits of virtualizing Hadoop through the lens of three real-world examples. You'll leave with the confidence to deploy your Hadoop clusters using virtualization.

Shui C. Yip is a senior manager of Pershing Business Analytics Services and the lead technology architect for search, Hadoop, and big data related technologies at BNY Mellon. Shui has been delivering world-class solutions that increase product competitiveness and promoting the use of search and big data technologies for more than six years and has been a consistent leader through the evolution of these key technologies. Today, the enterprise Hadoop shared service is used by development users across BNY Mellon’s global business lines and the enterprise search services are integrated in Pershing NetX360 integrated suite of digital financial solutions for advisors, broker-dealers, family offices, fund managers, and registered investment advisor firms. As an innovator, Shui reviews and introduces the best of new technologies; in addition to managing multiple technology teams and being a senior technical advisor, he plans and organizes enterprise best practices. Shui holds a PhD in mathematics from the University of Maryland and a PhD in engineering from Columbia University. Shui is part of a select group recognized by the BNY Mellon Best of Class Individual Award for consistently delivering superior results that align with client priorities and organizational success.

Presentations

Why is this disruption different from all other disruptions? Hadoop as a game changer in financial services Session

What's the point at which Hadoop tips from a Swiss-army knife of use cases to a new foundation that rearranges how the financial services marketplace turns data into profit and competitive advantage? This panel of expert practitioners looks into the near future to see if the inflection point is at hand.

Mike Yoder is a software engineer at Cloudera who has worked on a variety of Hadoop security features and internal security initiatives. Most recently, he implemented log redaction and the encryption of sensitive configuration values in Cloudera Manager. Prior to Cloudera, he was a security architect at Vormetric.

Presentations

A practitioner’s guide to securing your Hadoop cluster Tutorial

Many Hadoop clusters lack even basic security controls. Michael Yoder, Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You'll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Kelley Yohe has been a key Agile delivery manager in building and enhancing data services and banking platforms for a number of startups and businesses that have successfully challenged norms in financial services. With data at the core of her focus, Kelley has assisted businesses in building, enhancing, governing, and innovating their data strategies with tools, processes, techniques, and platforms, including an array of open source and cloud-based technologies.

Presentations

FinData day Tutorial

Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.

Journeying toward a better customer experience FinData

Kelley Yohe explores utilizing analytics to map a customer’s journey and understand the customer experience gaps and potential business drivers lurking behind the scenes of your data, waiting to be tapped for growth. With this knowledge, businesses can target the right features, measure the results toward growth, and ensure a proper foundation for selecting the right data, tools, and platforms.

Fang Yu is the cofounder and CTO of DataVisor, where her work focuses on big data for security. Over the past 10 years, Fang has developed algorithms and built systems for identifying various kinds of malicious traffic including worms, spam, bot queries, faked and hijacked account activities, and fraudulent financial transactions. Fang holds a PhD degree from the EECS Department at the University of California, Berkeley.

Presentations

Account takeovers are taking over: How big data can stop them Session

The value of online user accounts has led to a significant increase in account takeover (ATO) attacks. Cyber criminals create armies of compromised accounts to perform attacks including fraudulent transactions, bank withdrawals, reward program theft, and more. Fang Yu explains how the latest in big data technology is helping turn the tide on ATO campaigns.

现任领英公司研发经理,领导核心大数据团队。该团队开发和应用HDFS,YARN,Spark,TensorFlow等开源技术,为领英公司的大数据平台提供核心的存储/计算引擎。

张喆同时还是Apache Hadoop项目的管理委员会(PMC)成员。也是Hadoop3的主要功能之一,HDFS纠删码(HDFS-EC)的作者。在加入领英之前,张喆就职于Cloudera和IBM沃森研究中心。2006年至今,在国际会议和期刊上发表论文20余篇,拥有5项美国专利。在IBM期间,获杰出技术成就奖(Outstanding Technology Achievement Award)。

Zhe Zhang is an engineering manager at LinkedIn, where he leads the Core Big Data Services team, which leverages open source technologies such as Hadoop, Spark, TensorFlow, and beyond to form the storage-compute engine of LinkedIn’s big data platform. Zhe is a PMC member of Apache Hadoop and author of HDFS erasure coding, a major feature for Hadoop 3.0. Previously, Zhe worked at Cloudera and IBM’s T. J. Watson Research Center. Zhe has over 20 research publications and 5 US patents. While at IBM, he received the Research Accomplishment Award and the Outstanding Technology Achievement Award.

Presentations

Debunking HDFS erasure coding performance myths Session

The new erasure coding feature in Apache Hadoop (HDFS-EC) reduces the storage cost by ~50% compared with 3x replication. Zhe Zhang and Uma Maheswara Rao G present the first-ever performance study of HDFS-EC and share insights on when and how to use the feature.

Shivon Zilis is a venture capitalist and founding member of Bloomberg Beta, where she focuses on early-stage data and machine-intelligence investments. Shivon has led 12 investments since launch. One, Newsle, was acquired by LinkedIn; others include Context Relevant, Alation, and InfluxDB. She recently released a report on the current state of machine intelligence that analyzed thousands of companies and put forward predictions on where the industry is headed. Shivon’s previous experience includes building startups at Bloomberg Ventures, the firm’s incubator, and developing cloud core banking solutions for microfinance institutions at IBM. She is a C100 charter member and was named to Forbes magazine’s 30 under 30 list in venture capital.

Presentations

Where's the puck headed? Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road. Join us to hear about the trends that everyone is seeing and areas for investment that they find exciting.