Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Strata + Hadoop World 2016 Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Ask me anything: Stream processing Ask Me Anything

Apache Beam/Google Cloud Dataflow engineers Tyler Akidau, Kenneth Knowles, and Slava Chernyak will be on hand to answer a wide range of detailed questions about stream processing. Even if you don’t have a specific question, join in to hear what others are asking.

The evolution of massive-scale data processing Session

Tyler Akidau offers a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, comparing and contrasting systems at Google with popular open source systems in use today.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential to accelerate business, but how do you reconcile the opportunity with the sea of possible technologies? Conventional data strategy offers little to guide us, focusing more on governance than on creating new value. Scott Kurth and John Akred explain how to create a modern data strategy that powers data-driven business.

The business case for Spark, Kafka, and friends Data 101

Spark is white-hot, but why does it matter? Some technologies cause more excitement than others, and at first the only people who understand why are the developers who use them. John Akred offers a tour through the hottest emerging data technologies of 2016 and explains why they’re exciting, in the context of the new capabilities and economies they bring.

Frank Albers is a software engineer on the big data DevOps team at ING. He specializes in HortonWorks/Hadoop, infrastructure solution patterns for cloud services, architecture, migration to the (private) cloud, security, and networking.

Presentations

Lessons from integrating Hadoop into an enterprise security architecture Session

How do you connect a Hadoop cluster to an enterprise directory with 100,000+ users and centralized role and access management? Hellmar Becker and Frank Albers present ING's approach to aligning Hadoop authentication and role management with ING’s policies and architecture, discuss challenges they met on the way, and outline the solutions they found.

Alasdair Allan is a scientist, author, hacker, tinkerer, and journalist who has recently been spending a lot of time thinking about the Internet of Things, which he thinks is broken. He is the author of a number of books and sometimes also stands in front of cameras. You can often find him at conferences talking about interesting things or deploying sensors to measure them. A couple of years ago, he rolled out a mesh network of five hundred sensor motes covering the entirety of Moscone West during Google I/O. He’s still recovering. A few years before that, he caused a privacy scandal by uncovering that your iPhone was recording your location all the time, which caused several class-action lawsuits and a US Senate hearing. Some years on, he still isn’t sure what to think about that.

Alasdair sporadically writes blog posts about things that interest him or, more frequently, provides commentary in 140 characters or less. He is a contributing editor for Make magazine and a contributor to O’Reilly Radar. Alasdair is a former academic. As part of his work, he built a distributed peer-to-peer network of telescopes that, acting autonomously, reactively scheduled observations of time-critical events. Notable successes included contributing to the detection of what was—at the time—the most distant object yet discovered, a gamma-ray burster at a redshift of 8.2.

Presentations

Data privacy in the age of the Internet of Things Session

Privacy is no longer "a social norm," but this may not survive as the Internet of Things grows. Big data is all very well when it is harvested in the background. But it's a very different matter altogether when your things tattle on you behind your back. Alasdair Allan explains how the rush to connect devices to the Internet has led to sloppy privacy and security and why that can't continue.

Nick Amabile is the cofounder and principal of FullStack Analytics, a data and analytics consulting firm based in Brooklyn, NY, that builds data pipelines and delivers business value from data using the latest technologies and analytical techniques. Nick has helped startups and Fortune 500 companies alike gain insight from their data. Most recently, he held leadership positions at Jet.com, during hypergrowth, and Etsy, during their IPO.

Presentations

Big SQL: The future of in-cluster analytics and enterprise adoption Session

Hear why big SQL is the future of analytics. Experts at Yahoo, Knewton, FullStack Analytics, and Looker discuss their respective data architectures, the trials and tribulations of running analytics in-cluster, and examples of the real business value gained from putting their data in the hands of employees across their companies.

Franz Aman is senior vice president of brand and demand at Informatica, where he is responsible for branding, global demand generation, marketing operations, content, and digital marketing. Previously, Franz held numerous executive positions within industry-leading technology companies, including SAP, BusinessObjects, BEA Systems, SGI, and Sun Microsystems. He has more than 20 years of experience in leadership and innovation across marketing, including global product marketing, product management, strategy, brand, and communications. Franz holds a degree in geophysics from Ludwig-Maximilians University, Munich, Germany.

Presentations

The business bottom line of data lakes: Real-life experiences Session

Data is no longer a by-product of business transactions; now, data is the business. Franz Aman explains how data lakes can put the power of big data into the hands of every business person, sharing the inside scoop on how he turned marketing into a new kind of revenue-generation machine and interviewing an Informatica customer about how data lakes have innovated and transformed their business.

Stephan Anné is a solution engineer at Hortonworks with a rich experience in presales. Previously, Stephan worked as a solutions consultant on a global base for NetApp and VMware, covering enterprise customers like Siemens and VW. He was the last employee with Sulzer GmbH, where he was a project leader for a strategical development project in the automobile industry. Stephan has also run his own company developing Java-based applications for sports monitoring. He is an ultramarathon runner and a three-time Marathon des Sables finisher. He has two boys, 8 and 11 years old.

Presentations

The IoT with Apache NiFi and Hadoop: Better together Session

The Internet of Things and big data analytics are currently two of the hottest topics in IT. But how do you get started using them? Emil Andreas Siemes and Stephan Anné demonstrate how to use Apache NiFi to ingest, transform, and route sensor data into Hadoop and how to do further predictive analytics.

Assaf Araki is the senior architect for big data analytics at Intel, where his group is responsible for big data analytics path findings within the company. Assaf drives the overall work with the academy and industry for big data analytics and merges new technologies inside Intel Information Technology. He has over 10 years of experience in data warehousing, decision support solutions, and applied analytics within Intel.

Presentations

How to build a big data analytics competency center Session

Big data analytics brings value to enterprises, helping them achieve operational excellence. The big question is how you implement it. Drawing on firsthand experience, Assaf Araki and Itay Yogev share how Intel built a big data analytics competency center, exploring the key elements that help Intel grow its people and capabilities and the challenges and lessons learned.

Carme Artigas is the founder and CEO of Synergic Partners, a strategic and technological consulting firm specializing in big data and data science (acquired by Telefónica in November 2015). She has more than 20 years of extensive expertise in the telecommunications and IT fields and has held several executive roles in both private companies and governmental institutions. Carme is a member of the Innovation Board of CEOE and the Industry Affiliate Partners at Columbia University’s Data Science Institute. An in-demand speaker on big data, she has given talks at several international forums, including Strata + Hadoop World, and collaborates as a professor in various master’s programs on new technologies, big data, and innovation. Carme was recently recognized as the only Spanish woman among the 30 most influential women in business by Insight Success. She holds an MS in chemical engineering and an MBA from Ramon Llull University in Barcelona and an executive degree in venture capital from UC Berkeley’s Haas School of Business.

Presentations

Data-driven businesses: Disrupting business models with big data DDBD

The most important challenge companies face in realizing the value of big data is implementing a cultural change to become a data-driven organization. Carme Artigas shares real-world examples focusing on the business side of this technology disruption to show how big data is transforming different industries including retail, insurance, telco, and digital businesses.

Johannes Bauer is currently lead data engineer at Cambridge Analytica. He has worked as a data engineer and data scientist at the SCL group and was a fellow of the ASI. Johannes holds a PhD in theoretical condensed matter physics and has postdoctoral experience working with big data and parallel processing at the Max Planck Institute (Germany) and Harvard University.

Presentations

A functional data integration pipeline using Scala HDS

Efficient, accurate, and robust ETL (extract, transform, load) pipelines are essential components for building successful data products. Johannes Bauer discusses the fundamental requirements for ETL pipelines, highlighting major guiding principles as well as challenges and outlining selected elements of ETL pipeline implementations using advanced elements of Scala.

Marie Beaugureau is the lead data editor for O’Reilly Media.

Presentations

Welcome remarks Data 101

O'Reilly Media lead data editor Marie Beaugureau welcomes you to Data 101.

Hellmar Becker is a solutions engineer at Hortonworks, where he is helping spread the word about what you can do with data in the modern world. Hellmar has worked in a number of positions in big data analytics and digital analytics. Previously, he worked at ING Bank implementing the Datalake Foundation project (based on Hadoop) within client information management.

Presentations

Lessons from integrating Hadoop into an enterprise security architecture Session

How do you connect a Hadoop cluster to an enterprise directory with 100,000+ users and centralized role and access management? Hellmar Becker and Frank Albers present ING's approach to aligning Hadoop authentication and role management with ING’s policies and architecture, discuss challenges they met on the way, and outline the solutions they found.

Thomas Beer is one of the lead architects at Continental’s eHorizon project, which massively applies big data technologies for processing and fusing various data sources such as vehicle sensor data, traffic data, and map data. Before joining Continental, Thomas was the lead architect for big data solutions at NTT Data. He holds a PhD in computer science.

Presentations

Year 2025: Big data as enabler of fully automated vehicles Session

Experience tells us a decision is only as good as the information it is based on. The same is true for driving. The better a vehicle knows its surroundings, the better it can support the driver. Information makes vehicles safer, more efficient, and more comfortable. Thomas Beer and Felix Werkmeister explain how Continental exploits big data technologies for building information-driven vehicles.

Francine Bennett is a data scientist and the CEO and cofounder of Mastodon C, a group of Agile big data specialists who offer the open source Hadoop-powered technology and the technical and analytical skills to help organizations to realize the potential of their data. Before founding Mastodon C, Francine spent a number of years working on big data analysis for search engines, helping them to turn lots of data into even more money. She enjoys good coffee, running, sleeping as much as possible, and exploring large datasets.

Presentations

The best university in the world Session

In 2014, Times Higher Education made the decision to move from being a traditional publisher to being a data business. As part of the move, it needed to bring the creation of the World University Rankings in-house and build a set of data products from scratch. Duncan Ross and Francine Bennett explain how the transition was made and highlight the challenges and lessons learned.

Using data for evil IV: The journey home Session

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis. Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse.

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer experience. Tim can frequently be found at speaking at conferences internationally and in the United States. He is the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio Podcast. He lives in Littleton, Colorado, with the wife of his youth and and their youngest child, the other two having mostly grown up.

Presentations

Apache Cassandra: Get trained and get better paid Training

O’Reilly Media and DataStax have partnered to create a 2-day developer course for Apache Cassandra. Get trained as a Cassandra developer at Strata + Hadoop World in London, be recognized for your NoSQL expertise, and benefit from the skyrocketing demand for Cassandra developers.

Apache Cassandra: Get trained and get better paid (Day 2) Training Day 2

O’Reilly Media and DataStax have partnered to create a 2-day developer course for Apache Cassandra. Get trained as a Cassandra developer at Strata + Hadoop World in London, be recognized for your NoSQL expertise, and benefit from the skyrocketing demand for Cassandra developers.

Danny Bickson is the cofounder of Dato, a Seattle-based predictive analytics company. Previously, Danny was a postdoc in the Machine Learning department at Carnegie Mellon University and one of the initiators of the GraphLab open source project.

Presentations

Recent trends in recommender systems HDS

A Netflix competition triggered a major academic research effort in recommender systems. However, there is still a big gap between academic research and industry. Danny Bickson covers the current state of recommender systems in industry and explains why, while user historical purchase data is understood very well, recommenders based on images and text are just starting to pick up.

Joerg Blumtritt is the founder and CEO of Datarella, a computational social science startup delivering mobile analytics, self-tracking solutions, and data science consulting. After graduating from university with a thesis on machine learning, Joerg worked as a researcher in behavioral sciences, focused on nonverbal communication. His projects have been funded by an EU commission, the German federal government, and the Max Planck Society. He subsequently ran marketing and research teams for TV networks ProSiebenSat.1 and RTL II and magazine publisher Hubert Burda Media. As European operations officer at Tremor Media, Joerg was in charge of building the New York-based video advertising network’s European enterprises. More recently, he was managing director of MediaCom Germany. Joerg is the founder and chairman of the German Social Media Association (AG Social Media) and the coauthor of the Slow Media Manifesto. Joerg blogs about big data and the future of social research at Beautifuldata.net and about the Quantified Self at Datarella.com.

Presentations

My AlgorithmicMe: The "Who is. . .?" of the future Session

Who does your computer think I am? Today, every person is digitally represented in a multitude of IT systems, based on invisible algorithms that pervasively control pieces of our lives through decisions made based on our preferences, interests, and even future actions. Joerg Blumtritt and Majken Sander explore these judgments, discuss their consequences, and present possible solutions.

Robert Bogucki is a chief science officer at deepsense.io, where he currently manages the R&D team and focuses on deep learning. Robert is also a successful Kaggle competitor. He started his professional career as a software engineer at UBS. Robert’s motivation to work in the IT industry is to bring the theoretical ideas and concepts and put them to good use. When tackling real-life problems, he particularly enjoys leveraging algorithms and computational power instead of, or in addition to, domain knowledge.

Presentations

Which whale is it anyway? Face recognition for right whales using deep learning Session

With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science community, NOAA Fisheries organized a competition hosted on Kaggle.com. Robert Bogucki and Maciej Klimek outline the winning solution.

Farrah Bostic created the Difference Engine based on her belief that deep understanding of customer needs is essential to growing businesses through great products and services. Farrah has honed her customer-centric insights as an advisor to some of the world’s most respected brands, including Apple, Microsoft, Disney, Samsung, and UPS. She began her career as a creative and then went on to be a strategist at leading agencies, including Wieden + Kennedy, TBWA\Chiat\Day, Mad Dogs & Englishmen, and Digitas, where she was group planning director and mobile strategy lead. Farrah also ran innovation as a partner at Hall & Partners and developed digital tools for online qualitative research as SVP of consumer immersion at OTX.

Presentations

How to ask good questions DDBD

We all like to say, "There's no such thing as a dumb question," and yet we—researchers, data scientists, and managers—also talk about asking the "right questions" when it comes to designing survey instruments, qualitative methods, databases, and other tools to help us make decisions. Farrah Bostic investigates where good questions come from and explains how to construct a good question.

Luc Bourlier has been working on the JVM since 2002, first for IBM on the Debugger team of the Eclipse project, where he wrote the expression evaluation engine. After a few other Eclipse projects, Luc went to TomTom to recreate their data distribution platform for over-the-air services. He then joined Lightbend to work on the Eclipse plugin for Scala before switching to the Fast Data team to focus on deployment and interaction of streaming systems.

Presentations

Reactive Streams: Linking reactive applications to Spark Streaming Session

Reactive Streams is an API designed to connect reactive systems with back-pressure. Luc Bourlier explains why, with Spark Streaming now supporting back-pressure, Reactive Streams is the right tool to connect Spark Streaming in a reactive application.

Claudiu Branzan is the director of data science at G2 Web Services where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine-learning and distributed-systems experience. Previously, Claudiu worked for Atigeo Inc, building big data and data science-driven products for various customers.

Presentations

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies Session

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Mikio Braun is delivery lead for recommendation and search at Zalando, one of the biggest European fashion platforms. Mikio holds a PhD in machine learning and worked in research for a number of years before becoming interested in putting research results to good use in the industry.

Presentations

Hardcore data science in practice HDS

Mikio Braun explains why, in practice, hardcore data science is not just about learning methods but also about bringing these methods to production. This does not mean simply reimplementing methods in production systems. Rather, you must successfully deal with issues like data updates, cultural differences between data science and developers, and how to monitor and test in practice.

Natalino Busa is the head of data science at Teradata, where he leads the definition, design, and implementation of big, fast data solutions for data-driven applications, such as predictive analytics, personalized marketing, and security event monitoring. Previously, Natalino served as enterprise data architect at ING and as senior researcher at Philips Research Laboratories on the topics of system-on-a-chip architectures, distributed computing, and parallelizing compilers. Natalino is an all-around technology manager, product developer, and innovator with a 15+ year track record in research, development, and management of distributed architectures and scalable services and applications.

Presentations

Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra Session

Which venues have similar visiting patterns? How can we detect when a user is on vacation? Can we predict which venues will be favorited by users by examining their friends' preferences? Natalino Busa explains how these predictive analytics tasks can be accomplished by using Spark SQL, Spark ML, and just a few lines of Scala code.

Neil Carden is a director of Aquila Insight, where he enables brands to systematically make better, evidence-led decisions for their customers. Neil has worked in financial services and retail across Europe and the US, making sure that the customer is at the heart of day-to-day and strategic decision making. His tool of choice is customer data (both big data and wee data), and his focus is on creating the people, process, and infrastructure engine that efficiently and effectively transforms customer data into change for customers at the shelf edge. Most recently he has been developing the Co-operative Group’s insight operating model to simplify and speed up the deployment of actionable insight to business decision makers—putting the right nugget of customer insight in front of the right decision maker at the right time.

Presentations

The curious and complicated practice of transforming data into value: Why big data has become notorious Data 101

Big data may be past the hype curve, but it's become notorious in the process. Everyone's doing it, but no one's quite sure how you turn it into action and unlock its potential value. Drawing on real-life examples, Neil Carden explores the practice of transforming data into value by examining the factors that slow things down and explaining how they've been overcome.

Connor Carreras is Trifacta’s manager for customer success in the Americas, where she helps customers use cutting-edge data wrangling techniques in support of their big data initiatives. Connor brings her prior experience in the data integration space to help customers understand how to adopt self-service data preparation as part of an analytics process. She is a coauthor of the O’Reilly book Principles of Data Wrangling.

Presentations

Improving the customer experience with big data wrangling on Hadoop Session

Big data provides an unprecedented opportunity to really understand and engage with your customers, but only if you have the keys to unlock the value in the data. Through examples from the Royal Bank of Scotland, Dan Jermyn and Connor Carreras explain how to use data wrangling to harness the power of data stored on Hadoop and deliver personalized interactions to increase customer satisfaction.

Slava Chernyak is a senior software engineer at Google. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Google Cloud Dataflow Streaming from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.

Presentations

Ask me anything: Stream processing Ask Me Anything

Apache Beam/Google Cloud Dataflow engineers Tyler Akidau, Kenneth Knowles, and Slava Chernyak will be on hand to answer a wide range of detailed questions about stream processing. Even if you don’t have a specific question, join in to hear what others are asking.

Watermarks: Time and progress in streaming dataflow and beyond Session

Watermarks are a system for measuring progress and completeness in out-of-order stream processing systems and are used to emit correct results in a timely way. Given the trend toward out-of-order processing in current streaming systems, understanding watermarks is an increasingly important skill. Slava Chernyak explains watermarks and demonstrates how to apply them using real-world cases.

Wellington Chevreuil has more than 12 years’ experience in the IT industry, mainly with SW development and support, and has been dealing with Hadoop and big data since 2011. Wellington currently works at Cloudera’s customer operations organization, helping companies from different sectors succeed with their big data deployments.

Presentations

Apache Hadoop operations for production systems Tutorial

Jayesh Seshadri, Justin Hancock, Mark Samson, and Wellington Chevreuil offer a full-day deep dive into all phases of successfully managing Hadoop clusters—from installation to configuration management, service monitoring, troubleshooting, and support integration—with an emphasis on production systems.

Ask me anything: Hadoop operations Ask Me Anything

Mark Samson, Jayesh Seshadri, Wellington Chevreuil, and James Kinley, the instructors of the the full-day tutorial Apache Hadoop Operations for Production Systems, field a wide range of detailed questions about Hadoop, from debugging and tuning across different layers to tools and subsystems to keep your Hadoop clusters always up, running, and secure.

Prakash Chockalingam is currently a solutions architect at Databricks, where he focuses on helping customers build their big data infrastructure, drawing on his decade-long experience with large-scale distributed systems and machine-learning infrastructure at companies including Netflix and Yahoo. Prior to joining Databricks, Prakash was with Netflix, designing and building the recommendation infrastructure that serves out millions of recommendations to users every day. His interests broadly include distributed systems and machine learning. He coauthored several publications on machine learning and computer vision research in the early stages of his career.

Presentations

So you think you can stream: Use cases and design patterns for Spark Streaming Session

So you’ve successfully tackled big data. Now let Vida Ha and Prakash Chockalingam help you take it real time and conquer fast data. Vida and Prakash cover the most common uses cases for streaming, important streaming design patterns, and the best practices for implementing them to achieve maximum throughput and performance of your system using Spark Streaming.

Jo-fai (Joe) Chow is a customer data scientist at H2O.ai. Joe liaises with customers to expand the use of H2O beyond the initial footprint. Before joining H2O, he was on the business intelligence team at Virgin Media, where he developed data products to enable quick and smart business decisions. Joe also worked (part-time) for Domino Data Lab as a data science evangelist, promoting products via blogging and giving talks at meetups.

Presentations

Introduction to generalized low-rank models and missing values Session

The generalized low-rank model is a new machine-learning approach for reconstructing missing values and identifying important features in heterogeneous data. Through a series of examples, Jo-fai Chow demonstrates how to fit low-rank models in a parallelized framework and how to use these models to make better predictions.

Cliff Click is the CTO and cofounder of 0xdata, a firm dedicated to creating a new way to think about web-scale math and real-time analytics. Cliff wrote his first compiler when he was 15 (Pascal to TRS Z-80), although his most famous compiler is the HotSpot Server Compiler (the Sea of Nodes IR). He helped Azul Systems build an 864-core pure-Java mainframe that keeps GC pauses on 500 GB heaps to under 10 ms and worked on all aspects of that JVM. Previously, Cliff worked on HotSpot at Sun Microsystems. He is at least partially responsible for bringing Java into the mainstream. He is the author of about 15 patents, has published many papers about HotSpot technology, and is regularly invited to speak at industry and academic conferences. Cliff holds a PhD in computer science from Rice University.

Presentations

The innards of H2O Session

H2O is an in-memory, big-data, big-math machine-learning platform. Cliff Click offers a technical talk focused on the insides of H2O. Cliff explains how you can write simple, single-threaded Java code and have H2O autoparallelize and auto-scale-out to hundreds of nodes and thousands of cores.

Ira Cohen is a cofounder of Anodot and its chief data scientist, where he is responsible for developing and inventing its real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Analytics for large-scale time series and event data HDS

Time series and event data form the basis for real-time insights about the performance of businesses such as ecommerce, the IoT, and web services, but gaining these insights involves designing a learning system that scales to millions and billions of data streams. Ira Cohen outlines such a system that performs real-time machine learning and analytics on streams at massive scale.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata + Hadoop World conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Data-driven business day opening remarks DDBD

Strata + Hadoop World London program chair Alistair Croll opens Data-Driven Business Day.

DDBD closing remarks DDBD

Program chair Alistair Croll closes Data-Driven Business Day.

Friday keynote welcome Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Thursday keynote welcome Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Kenneth Cukier is a senior editor overseeing data and digital products of the Economist. Previously, he was the paper’s Tokyo correspondent and before that, its technology correspondent in London. Kenneth is the coauthor of the award-winning book Big Data: A Revolution that Transforms How We Work, Live and Think, a New York Times bestseller that has been translated into over 20 languages, as well as a regular commentator in the media on business and technology and a frequent public speaker on trends in big data.

Presentations

How AI revolutionizes business strategy

For centuries, the decisions made in a company were the responsibility of the top managers. But when firms harness AI and big data, algorithms can make millions more decisions in the same time, and probably better ones. Kenneth Cukier explores how this affects the ways that companies are organized and how they compete and set strategy (as opposed to just execution).

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Friday keynote welcome Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

The next 10 years of Apache Hadoop Session

Ben Lorica hosts a conversation with Hadoop cofounder Doug Cutting and Tom White, an early user and committer of Apache Hadoop.

Thursday keynote welcome Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Frank Cuypers is a world leader in place marketing and place making. Currently, Frank is a strategist for the Canadian company Destination Think! where he offers his deep insights into city making and city marketing. In addition to being a speaker at international conferences in North America, Europe, and Australia, a marketing professor, and an advisor in policy and place making for a variety of European cities, Frank has also been a business journalist, festival programmer, and manager of business intelligence and marketing research. He is also a keen traveller and insists on returning to Berlin each year because it’s the “ever-unfinished city.” Frank recently started a new project about urban DNA called Why Your City?.

Presentations

A new way of measuring the economic impact of social media, or the trouble with Tribbles DDBD

Frank Cuypers offers a case study about the research and metric that were constructed as the standard for the tourism industry to monetize the value of marketing actions online. This is is a story about data and tactics, tourism and politics, and Star Trek.

Alex Dalyac is cofounder and CEO of Tractable. Previously, he worked as a quantitative researcher at London hedge fund Toscafund. Alex holds an MS in computer science from Imperial College London and was a recipient of the Philips Prize in computer science.

Presentations

Addressing the labeling bottleneck in deep learning for computer vision HDS

The bottleneck in computer vision is in creating sufficiently large, labeled training sets for tasks. Alexandre Dalyac and Robert Hogan address this issue through a combination of dimensionality reduction, information retrieval, and domain adaptation techniques packaged in a software product that acts as a human-algorithm interface to facilitate transfer of expertise from human to machine.

In her research career, Roxana Danger has often pursued and achieved the dual goal of improving the performance of information extraction systems while proposing and validating novel mechanisms for storing and analyzing the extracted data in semantic knowledge databases. Roxana is currently working as a data scientist at ReedOnline LTD, designing and applying machine learning and NLP techniques for providing data-driven insights to the company. She was previous enrolled as a research associate at the Computing Department of Imperial College London, where she designed and implemented a provenance platform and data mining tools for diagnosis decision support in health care systems, as part of EU-FP7 project TRANSFoRm, and at the Department of Computer Systems and Computation at Universidad Politécnica de Valencia, Spain, where she worked on the development of an information extraction system for protein-protein interactions. Roxana holds a PhD from University Jaume I, Castellon, Spain, where her project aimed at extracting and analyzing semantic data from archaeology site excavation reports, and undergraduate and master’s degrees in computer science from Universidad de Oriente, Santiago de Cuba.

Presentations

A methodology for taxonomy generation and maintenance from large collections of textual data HDS

One of the main challenges organizations face is the semantic categorization of textual data. Roxana Danger offers an overview of ROOT, the reed online occupational taxonomy, which was constructed to improve the quality of services at reed.co.uk, and discusses this semisupervised methodology for generating (and maintaining) taxonomies from large collections of textual data.

IT veteran and big data evangelist Sebastian Darrington has worked in the industry for over 20 years. At EMC, Seb is responsible for driving its big data analytics offering in EMEA. This involves helping many global, European-based customers realize the value and benefits of using big data to empower them to grow, innovate, and remain competitive. He works with leading IT manufacturers and system integrators right through to the latest technology startups. Prior to joining EMC, Sebastian held a variety of senior technical consulting roles, including head of the EMEA technical practice at Kashya, head of technical strategy at Source Consulting, and the EMEA technical practice head at Veritas. He can be found blogging on The Storage Chap as well as on Twitter, discussing latest developments, topics, and themes of interest to the wider IT industry, including storage, big data, and beyond. Outside of work, Seb is a keen runner, and when he’s not training for the latest half marathon, he takes to the skies flying radio-controlled aircraft. He is married with two young sons and lives in the UK.

Presentations

Developing a successful big data strategy Session

Many businesses have undertaken big data projects, but for every successful project, there are dozens that have failed or stagnated. Seb Darrington explores the reasons why such projects hit obstacles, typical challenges, and how to overcome them along your own big data journey.

Tathagata Das is an Apache Spark committer and a member of the PMC. He is the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.

Presentations

Spark 2.0: What’s next? Session

Spark 2.0 is a major milestone for the project. It achieves major advances in performance and introduces new initiatives to unify streaming processing with the Spark’s SQL engine. Tathagata Das explores these exciting new developments in Spark 2.0 as well as some other major initiatives that are coming in the future.

The future of streaming in Spark: Structured streaming Session

Tathagata Das explains how Spark 2.x develops the next evolution of Spark Streaming by extending DataFrames and Datasets in Spark to handle streaming data. Streaming Datasets provides a single programming abstraction for batch and streaming data and also brings support for event-time-based processing, out-of-order data, sessionization, and tight integration with nonstreaming data sources.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Don't build a data swamp: Hadoop governance case studies for financial services Session

Mark Donsky and Chang She explore canonical case studies that demonstrate how leading banks, healthcare, and pharmaceutical organizations are tackling Hadoop governance challenges head-on. You'll learn how to ensure data doesn't get lost, help users find and trust the data they need, and protect yourself against a data breach—all at Hadoop scale.

Jim Dowling is a senior researcher at SICS – Swedish ICT and an associate professor at KTH Stockholm. Jim is a researcher in the areas of distributed systems, machine learning, and large-scale computer systems. He also worked as a senior consultant for MySQL AB. Jim is lead architect of Hadoop Open Platform as a Service, a more scalable and highly available distribution of the Hadoop.

Presentations

HopsWorks: Multitenant Hadoop as a service Session

Currently, multitenancy in Hadoop is limited to organizations running separate Hadoop clusters, and the secure sharing of resources is achieved using virtualization or containers. Jim Dowling describes how HopsWorks enables organizations to securely share a single Hadoop cluster using projects and a new metadata layer that enables protection domains while still allowing data sharing.

Cat Drew is a hybrid policy maker and designer, with over 10 years’ experience of working in government, including in the Cabinet Office and No.10. She also has a post-graduate education in design, which enables her to seek out innovative, new practices—for example, speculative design and data visualization—and experiment with how they could work in government. Cat works within the Policy Lab, a small unit in the Cabinet Office which supports departments in trying out new, innovative techniques to design policy solutions to tricky social problems. This includes leading multidisciplinary teams of policy makers, ethnographers, and data scientists to work with service providers and users to research, codesign, and prototype new ideas. She also works for the UK government’s Data Science Partnership, which promotes the use of new computer techniques and data to create insight to improve policy making and make government more efficient. Her work focuses on creating an ethical framework to give data scientists and policy makers the confidence to maximize their use of these new tools and techniques. Previously Cat was head of police digitization and neighbourhood policing and worked in strategic roles in the Home Office, Cabinet Office, and No.10.

Presentations

Bringing big data and design to policy making Keynote

Cat Drew explains how the UK's Policy Lab and GDS data teams are bringing more of a data, digital, and design approach to policy making, showcasing some of the Policy Lab projects that have used ethnography and data science to create fresh insight to change the way we think about policy problems.

Ted Dunning has been involved with a number of startups—the latest is MapR Technologies, where he is chief application architect working on advanced Hadoop-related technologies. Ted is also a PMC member for the Apache Zookeeper and Mahout projects and contributed to the Mahout clustering, classification, and matrix decomposition algorithms. He was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Presentations

Anomaly detection in telecom with Spark Session

Telecom operators need to find operational anomalies in their networks very quickly. Spark plus a streaming architecture can solve these problems very nicely. Ted Dunning presents a practical architecture as well as some detailed algorithms for detecting anomalies in event streams. These algorithms are simple and quite general and can be applied across a wide variety of situations.

Office Hour with Ted Dunning (MapR Technologies) Office Hours

Ted will answer questions about anomaly detection, streaming data architectures including Kafka and MapR Streams, processing frameworks such as Flink, Apex, Storm, and Spark Streaming, and just about anything else you may want to talk about.

Simon Elliston Ball is a solutions engineer at Hortonworks, where he helps clients do Hadoop. Simon is a certified Spark and Hadoop developer. Previously, he worked in the data-intensive worlds of hedge funds and financial trading, ERP, and ecommerce, as well as designing and running nationwide networks and websites. Over the course of those roles, he designed and built several organization-wide data and networking infrastructures, headed up research and development teams, and designed (and implemented) numerous digital products and high-traffic transactional websites. For a change of technical pace, Simon writes and produces screencasts on frontend web technologies and performance and is an avid Node.js programmer.

Presentations

High-performance data flow with a GUI—and guts Session

Apache NiFi has seen it all. (It worked for the NSA after all.) What it brings to the Hadoop ecosystem is a series of data flow and ingest patterns, a GUI, and a lot of security and record-level data provenance. Simon Elliston Ball offers an overview of Apache NiFi and explores its innovations around content and provenance repositories.

Office Hour with Simon Elliston Ball (Hortonworks) Office Hours

If you have IoT, streaming, data-in-motion, machine-learning, or Spark topics you want to discuss, talk to Simon.

Stephan Ewen is one of the originators and committers of the Apache Flink project and CTO at data Artisans, leading the development of large-scale data stream processing technology. Stephan coauthored the Stratosphere system and has worked on data processing technologies at IBM and Microsoft. Stephan holds a PhD from the Berlin University of Technology.

Presentations

Enabling new streaming applications with Apache Flink Session

Data stream processing is emerging as a new paradigm for the data infrastructure. Streaming promises to unify and simplify many existing applications while simultaneously enabling new applications on both real-time and historical data. Stephan Ewen and Kostas Tzoumas introduce the data streaming paradigm and show how to build a set of simple but representative applications using Apache Flink.

Office Hour with Stephan Ewen (data Artisans/Apache Flink) Office Hours

Curious about Flink? Stephan will answer questions about Apache Flink, data streaming applications, the patterns of continuous streaming applications, details about how to build such applications on top of Apache Flink, or how to integrate Flink with other systems and applications.

Moty Fania owns development and architecture in the advanced analytics group within Intel IT. With over 13 years of experience in analytics, data warehousing, and decision support solutions, Moty drives the overall technology and architectural roadmap for big data analytics in Intel IT. Moty is also the architect behind Intel’s IoT big data analytics platform. He holds a bachelor’s degree in computer science and economics and a master’s degree in business administration from Ben-Gurion University in Israel.

Presentations

Stream analytics in the enterprise: A look at Intel’s internal IoT implementation Session

Moty Fania shares Intel’s IT experience implementing an on-premises IoT platform for internal use cases. The platform was based on open source big data technologies and containers and was designed as a multitenant platform with built-in analytical capabilities. Moty highlights the key lessons learned from this journey and offers a thorough review of the platform’s architecture.

Ruhollah Farchtchi is chief technologist and vice president of Zoomdata Labs. Ruhollah has over 15 years’ experience in enterprise data management architecture and systems integration. Prior to Zoomdata, he held management positions at BearingPoint, Booz-Allen, and Unisys. Ruhollah holds an MS in information technology from George Mason University.

Presentations

Building real-time BI systems with HDFS and Kudu Session

Ruhollah Farchtchi explores best practices for building systems that support ad hoc queries over real-time data and offers an overview of Kudu, a new storage layer for Hadoop that is specifically designed for use cases that require fast analytics on rapidly changing data with a simultaneous combination of sequential and random reads and writes.

Sameer Farooqui is a client services engineer at Databricks, where he works with customers on Apache Spark deployments. Sameer works with the Hadoop ecosystem, Cassandra, Couchbase, and general NoSQL domain. Prior to Databricks, he worked as a freelance big data consultant and trainer globally and taught big data courses. Before that, Sameer was a systems architect at Hortonworks, an emerging data platforms consultant at Accenture R&D, and an enterprise consultant for Symantec/Veritas (specializing in VCS, VVR, and SF-HA).

Presentations

Spark camp: Exploring Wikipedia with Spark Tutorial

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Sameer Farooqui explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.

Taryn Fixel is the founder of ingredient1, a food-tech startup removing the chaos from food. Ingredient1 is the premier resource for people to discover food based on bioindividual diet needs, taste, and philosophy—and get the social recommendations that give shoppers the confidence in their food choices. Ingredient1 is creating an interconnected ecosystem for consumers, the food industry, and medicine: this is the foundation for knowing how food impacts human health. Launched in March 2015, ingredient1 was rated one of the top Health and Wellness Apps of 2015 and named one of three Big Ideas Changing the Way We Eat by Inc. magazine. Prior to ingredient1, Taryn was an investigative and documentary journalist for CNN and CBS. She traveled the nation and the globe reporting on events from Hurricane Katrina to the Arab Spring. Taryn has been recognized and honored for her work in broadcast journalism, earning Peabody awards in 2008, 2011, and 2012, and is also a New York Festival gold and silver medal winner for social issue documentaries and best investigative long-form report, respectively.

Presentations

What should I eat: The road map to better food and smarter nutrition science DDBD

Taryn Fixel investigates the bioindividuality of food choices and explains how flexible data structures are capturing real-time food behaviors that will transform our understanding of nutrition and human health.

Clara Fletcher is a senior manager and technical architect at Accenture. Clara comes from a broad background that includes econometric forecasting, complex event processing, and infrastructure design and enterprise data provisioning. She is actively involved with emerging big data technologies, industry interest groups, and volunteer education programs. She has won the National Service Trust award, served as a Hackbright mentor, and holds a patent in digital document verification technology. Clara also instructs the Accenture hands-on big data course and is the lead of the online NoSQL course development.

Presentations

Best practices and solutions to manage and govern a multinational big data platform Session

As companies seek to expand their global data footprint, many new challenges arise. How can data be shared across countries? How can a company go about managing all of the policies and regulations specific to each country that customers reside in? Clara Fletcher explores best practices and lessons learned in international data management and security.

Thomas French is the CTO of Sandtable, a data science company based in London. Before joining Sandtable, Thomas completed a doctorate in informatics at the University of Edinburgh.

Presentations

Doing data science to support strategic business decisions Data 101

Much of the business narrative around data science draws attention to the importance of prediction and predictive models. But in many business contexts, prediction alone is not enough. Thomas French and Nigel Shardlow explain that to support strategic decision making, data scientists must build models that can explain events in a meaningful way.

Ellen Friedman is a solutions consultant, scientist, and O’Reilly author currently writing about a variety of open source and big data topics. Ellen is a committer on the Apache Drill and Mahout projects. With a PhD in biochemistry and years of work writing on a variety of scientific and computing topics, she is an experienced communicator. Ellen is coauthor of Streaming Architecture, the Practical Machine Learning series from O’Reilly, Time Series Databases, and her newest title, Introduction to Apache Flink. She’s also coauthor of a book of magic-themed cartoons, A Rabbit Under the Hat. Ellen has been an invited speaker at Strata + Hadoop in London, Berlin Buzzwords, the University of Sheffield Methods Institute, and the Philly ETE conference and a keynote speaker for NoSQL Matters 2014 in Barcelona.

Presentations

Building better cross-team communication DDBD

Organizations need to build cultures comfortable with data-driven decisions, so it's increasingly important that groups with widely different knowledge bases—such as business decision makers, domain experts, and technical developers—exchange information and ideas effectively. Ellen Friedman outlines specific steps to cut through barriers and build strength in cross-team communication.

Michal Galas is a chief programmer at the UK PhD Centre in Financial Computing and Analytics and a research associate in financial computing in the Computer Science Department at UCL. Michal is an experienced IT technical leader, enterprise-size systems architect, and research-groups manager with 10 years of experience in financial computing and algorithmic trading. Michal specializes in R&D of large-scale systems, including service-oriented financial platforms, event-driven analytical engines, and HPC- based simulation environments. Michal holds a BSc in computer science with artificial intelligence and an MSc in artificial intelligence from City University London, as well as a PhD in financial computing from UCL.

Presentations

Experimental computational simulation environments for big data analytics in social sciences HDS

Experimental computational simulation environments are increasingly being developed by major financial institutions to model their analytic algorithms. Michal Galas introduces the key concepts underlying these environments, which rely on big data analytics to enable large-scale testing, optimization, and monitoring of algorithms running in the virtual or real mode.

Tanya Gallagher is a veteran technical instructor with thousands of hours of classroom experience across a 20-year career. Tanya has spent the past two years at DataStax writing curriculum and leading the curriculum development team. Prior to DataStax, she was a curriculum developer and technical instructor for Ixia, a networking test equipment manufacturer. She holds an MA in instructional design.

Presentations

Apache Cassandra: Get trained and get better paid Training

O’Reilly Media and DataStax have partnered to create a 2-day developer course for Apache Cassandra. Get trained as a Cassandra developer at Strata + Hadoop World in London, be recognized for your NoSQL expertise, and benefit from the skyrocketing demand for Cassandra developers.

Apache Cassandra: Get trained and get better paid (Day 2) Training Day 2

O’Reilly Media and DataStax have partnered to create a 2-day developer course for Apache Cassandra. Get trained as a Cassandra developer at Strata + Hadoop World in London, be recognized for your NoSQL expertise, and benefit from the skyrocketing demand for Cassandra developers.

Richard Gascoigne is a founding director and CTO for DTP Solutionpath, an analytics software company spun out of HPE Platinum reseller DTP. Richard started his career in the British Army, serving across the world until leaving to start his IT career in banking software. Richard was at the vanguard of software-as-a-service solutions and has used data and information for decision making across his career as well as developing business-specific solutions as a creative leader.​

Presentations

Empowering the data-driven organization Session

As we strive to realize big data's value, many seek more agile and capable analytic systems that ensure end-to-end security. Chris Selland and Richard Gascoigne explore Hewlett Packard Enterprise's robust yet flexible offering that scales with evolving needs, covering HPE's big data reference architecture, Vertica SQL on Hadoop, and machine learning as a service.

Laurent Gaubert is the director of market intelligence and customer analytics at Autodesk. Previously, Laurent worked for Hewlett-Packard in France, holding various positions in product marketing, field marketing, channel partner management, and strategic alliances for their PC and printer business, as well as the then small startup Research in Motion. . .but unwisely left before their IPO. (He still regrets this decision.) Outside of work, besides spending time with his family, Laurent is an avid skier and windsurfer. He’s trying to get into kite surfing following the advice of some of his staff though he’s still in the learning phase.

Presentations

Data science as catalyst of Autodesk's business model transformation DDBD

Autodesk's transition to a subscription business model has caused the company to rethink how it interacts with and engages its customers. Laurent Gaubert details how, in a short period of time, Autodesk has executed numerous data science projects that have enhanced its capabilities to acquire, retain, and provide more value to its customers.

Gabrielle Gianelli has served in both engineering and product management roles, giving her unique experience in using data to drive decision making, building large-scale systems, and designing data tools to meet organizational needs. Currently, Gabrielle is a senior program manager at Etsy for the Data Engineering team, where she has been leading the feature roadmap for Etsy’s internally built A/B testing tool and working on data reliability across key company metrics. She graduated from Princeton University with a liberal arts degree and minor in computer science.

Presentations

Intuit, Uber, and Etsy: Scaling innovation with A/B testing Session

A data-driven culture empowers companies to deliver greater value to their customers, yet many organizations still struggle to break down cultural barriers and drive data-driven innovation across their products. Lucian Lita, Mita Mahadevan, Shalin Mantri, and Gabrielle Gianelli explore Intuit's, Uber's, and Etsy's A/B platforms, which enable experimentation and engender a data-driven mentality.

Rex Gibson is head of data warehousing at Knewton, the world’s leading adaptive learning platform. Knewton’s mission, to bring personalized education to the world, is built on its data. Rex wrote his first code on an Apple IIe in 1986. In 1996, he wrote his first SQL statement at Webster University while soldering circuits and transcribing Charlie Parker solos. Since then Rex has made his mark developing tools that make businesses more efficient. Rex has built data warehouses for a wide variety of industries including finance, construction, arts/entertainment, human resources, government, retail, and edtech. He has defended the United Nation’s Mission websites from hackers, built an email marketing platform for the Metropolitan Opera, and managed 24×7 systems teams. Rex is profoundly grateful to all of the talented people he has learned from along the way. His most recent teacher is two years old and loves dinosaurs and ukulele music.

Presentations

Big SQL: The future of in-cluster analytics and enterprise adoption Session

Hear why big SQL is the future of analytics. Experts at Yahoo, Knewton, FullStack Analytics, and Looker discuss their respective data architectures, the trials and tribulations of running analytics in-cluster, and examples of the real business value gained from putting their data in the hands of employees across their companies.

Charles Givre is an unapologetic data geek who is passionate about helping others learn about data science and become passionate about it themselves. For the last five years, Charles has worked as a data scientist at Booz Allen Hamilton for various government clients and has done some really neat data science work along the way, hopefully saving US taxpayers some money. Most of his work has been in developing meaningful metrics to assess how well the workforce is performing. For the last two years, Charles has been part of the management team for one of Booze Allen Hamilton’s largest analytic contracts, where he was tasked with increasing the amount of data science on the contract—both in terms of tasks and people.

Even more than the data science work, Charles loves learning about and teaching new technologies and techniques. He has been instrumental in bringing Python scripting to both his government clients and the analytic workforce and has developed a 40-hour Introduction to Analytic Scripting class for that purpose. Additionally, Charles has developed a 60-hour Fundamentals of Data Science class, which he has taught to Booz Allen staff, government civilians, and US military personnel around the world. Charles has a master’s degree from Brandeis University, two bachelor’s degrees from the University of Arizona, and various IT security certifications. In his nonexistent spare time, he plays trombone, spends time with his family, and works on restoring British sports cars.

Presentations

What does your smart car know about you? Session

In the last few years, auto makers and others have introduced devices to connect cars to the Internet and gather data about the vehicles’ activity, and auto insurers and local governments are just starting to require these devices. Charles Givre gives an overview of the security risks as well as the potential privacy invasions associated with this unique type of data collection.

Joe Goldberg is the lead solutions marketing manager at BMC Software, where he helps BMC products leverage new technology to deliver market-leading solutions with a focus on workload automation and big data. Joe has more than 35 years of experience in the design, development, implementation, sales, and marketing of enterprise solutions to Global 2000 organizations.

Presentations

Operating batch in the data-driven enterprise Session

Joe Goldberg discusses the attributes required of a batch management platform that can accelerate development by enabling programmers to generate workflows as code, support continuous deployment with rich APIs and lightweight workflow scheduling infrastructure, and optimize production with comprehensive enterprise operational capabilities like SLA management and full log and output management.

Gopal GopalKrishnan is a solution architect in the Partners & Strategic Alliances group at OSIsoft. Gopal has been working with OSIsoft’s PI System since the mid-1990s in software development, technical and sales support, and field services. Previously, he was a product manager with a focus on enterprise and asset integration and PI data access. Gopal is a registered professional engineer in Pennsylvania. He is a member of the MESA technical Committee, the Education Committee, and the MESA Continuous Process Industry Special Interest Group. He actively participates in topics such as big data, data mining, energy efficiency, manufacturing intelligence, and sustainability (including green initiatives in facilities and data centers). Gopal holds a master’s degree in engineering and continuing education in business administration.

Presentations

Industrial big data and sensor time series data: Different but not difficult Session

For decades, industrial manufacturing has dealt with large volumes of sensor data and handled a variety of data from the various manufacturing operations management (MOM) systems in production, quality, maintenance, and inventory. Gopal GopalKrishnan and Hoa Tram offer lessons learned from applying big data ecosystem tools to oil and gas, energy, utilities, metals, and mining use cases.

Dirk Gorissen is the head of R&D at Skycap and a consultant for the World Bank. Dirk has worked in research labs in the US, Canada, Europe, Africa and worked with a large range of industrial partners, including Rolls-Royce, BMW, ArcelorMittal, NXP, and Airbus. His interests span machine learning, data science, and computational engineering, particularly in the unmanned systems domains. Dirk holds master’s degrees in computer science and artificial intelligence and a PhD in computational engineering. After eight years in academia, he joined BAE Systems Research, where he worked on big data analysis, deep learning, integrated vehicle health management, and autonomous systems-related topics, before going into business for himself, dividing his time between the World Bank (where he advises the Tanzanian government), UAV startup Skycap (where he leads the R&D activities), and a number of other startups. Dirk is an active STEM Ambassador and organizer of the London Big-O Algorithms & Machine Learning meetups and is active in the Tech4Good/ICT4D space.

Presentations

Land mine or Coke can: Machine learning from GPR data Session

Dirk Gorissen demonstrates how to use machine learning to detect land mines from a drone-mounted ground-penetrating radar sensor.

As training lead at Mango, Aimee Gott has delivered over 200 days of training, including onsite training courses in Europe and the US in all aspects of R as well as shorter workshops and online webinars. Aimee oversees Mango’s training course development across the data science pipeline and regularly attends R user groups and meetups. Aimee is also a coauthor of Sams Teach Yourself R in 24 Hours. Aimee holds a PhD in statistics from Lancaster University.

Presentations

R and reproducible reporting for big data Tutorial

Aimee Gott, Mark Sellors, and Richard Pugh explore techniques for optimizing your workflow in R when working with big data, including how to efficiently extract data from a database, techniques for visualization and analysis, and how all of this can be incorporated into a single, reproducible report, directly from R.

Tugdual Grall is an open source advocate, a passionate developer, and a chief technical evangelist EMEA at MapR, where he works to ease MapR, Hadoop, and NoSQL adoption within European developer communities. Before joining MapR, Tug was a technical evangelist at MongoDB and Couchbase. Tug has also worked as CTO at eXo Platform and JavaEE product manager and software engineer at Oracle. Tugdual is a cofounder of the Nantes JUG (Java user group) and also writes a blog.

Presentations

High-frequency decisioning, from big data to fast data Session

Stream-based technologies allow big data applications to deal with low-latency decisions and provide a more agile way to develop and deploy applications. Tugdual Grall details the various elements of a stream-based application and outlines the key capabilities of modern messaging layers like Apache Kafka and MapR Streams.

Olivier Grisel is a software engineer at Inria Saclay, France, where he works on scikit-learn, an open source project for machine learning in Python. Olivier also contributes occasional bug fixes to upstream projects in the NumPy/SciPy ecosystem.

Presentations

Recent advances in deep learning research HDS

Deep learning leverages compositions of parametrized differentiable modules commonly referred to as neural networks to build versatile and powerful predictive models from richly annotated data. Olivier Grisel offers an overview of recent trends and advances in deep learning research in computer vision, natural language understanding, and agent control via reinforcement learning.

Mark Grover is a software engineer working on Apache Spark at Cloudera. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating) and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Presentations

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Jonathan Seidman, Ted Malaska, and Gwen Shapira, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Hadoop application architectures: Fraud detection Tutorial

Jonathan Seidman, Mark Grover, Gwen Shapira, and Ted Malaska walk attendees through an end-to-end case study of building a fraud detection system, providing a concrete example of how to architect and implement real-time systems.

Kanu Gulati is a senior associate at Zetta Venture Partners. Kanu has over 10 years of operating experience as an engineer, scientist, and strategist. She owned Intel’s multicore CAD algorithms research roadmap, developed advanced parallel CAD solutions, and pioneered metrics-driven methodology improvements for design flows. In addition, Kanu led due diligence and provided deal support for early-stage investments at Intel Capital and Khosla Ventures. Kanu was the first business hire at MapD (hardware-accelerated data visualization) and held engineering roles at Nascentric (fast-circuit simulation tool, acquired by Cadence) and Atrenta (predictive analytics for design verification and optimization, acquired by Synopsys), among others.

Kanu has coauthored 3 books, a book chapter, 35+ peer-reviewed publications with 370+ citations, and 1 US patent on high-performance computing and hardware acceleration. She has a PhD and master’s degree from Texas A&M University and an undergraduate degree from Delhi College of Engineering. Kanu completed her MBA at Harvard Business School, where she was copresident of the annual Venture Capital and Private Equity Conference.

Presentations

Opportunities for hardware acceleration in big data analytics Session

Hardware accelerated solutions are ready to meet challenges in data collection, exploration, and visualization. Simply stated, data analytics and high-performance computing evolution must go hand in hand. Kanu Gulati provides an overview of the advances in hardware acceleration and discusses specific real-world use cases of HPC applications that are enabling innovation in analytics.

Chaitali Gupta is a senior software engineer on the Hadoop Platform team at eBay. Chaitali holds a PhD in computer science from SUNY Binghamton, where she worked as a research assistant at the SUNY Binghampton’s Grid Computing Research Laboratory. Her interests included query, semantic reasoning, and management of scientific metadata and web services in large-scale grid computing environments.

Presentations

Apache Eagle: Secure Hadoop in real time Session

Apache Eagle is an open source monitoring solution to instantly identify access to sensitive data, recognize malicious activities, and take action. Arun Karthick Manoharan, Edward Zhang, and Chaitali Gupta explain how Eagle helps secure a Hadoop cluster using policy-based and machine-learning user-profile-based detection and alerting.

Tal Guttman is director of the Core Platform at Windward, a maritime data and analytics company. In this role, Tal leads a team building the world’s first maritime data platform, which is analyzing and organizing the world’s maritime data and making it accessible and actionable across the ecosystem, from flagging criminal threats at sea to combatting illegal fishing to identifying new market trading opportunities. Prior to joining Windward, Tal was head of strategic innovation for the IDF’s Elite Intelligence Corps unit, 8200. Tal holds a bachelor’s degree in history from Tel Aviv University. He lives in Tel Aviv with his wife and two sons.

Presentations

90% of the world's trade is transported by sea, but what data do we have about ship activity worldwide? Session

With over 90% of the world’s trade transported over the oceans, data on ship activity is critical to decision makers across industries. But despite the huge stakes at sea, ship activity remains a mystery: the data is massive, fragmented, and extremely unreliable when taken as is. Tal Guttman explores how data science can shed light on this critically important but opaque world.

Vida Ha is currently a solutions engineer at Databricks. Previously, she worked on scaling Square’s reporting analytics system. Vida first began working with distributed computing at Google, where she improved search rankings of mobile-specific web content and built and tuned language models for speech recognition using a year’s worth of Google search queries. She’s passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.

Presentations

So you think you can stream: Use cases and design patterns for Spark Streaming Session

So you’ve successfully tackled big data. Now let Vida Ha and Prakash Chockalingam help you take it real time and conquer fast data. Vida and Prakash cover the most common uses cases for streaming, important streaming design patterns, and the best practices for implementing them to achieve maximum throughput and performance of your system using Spark Streaming.

Justin Hancock has worked in the technology industry for 20 years and has worked with Hadoop and related technologies for the last 6 years as a developer, architect, and support engineer and manager. Justin currently works at Cloudera as technical customer success manager, assisting Cloudera’s largest customers in EMEA with their Hadoop deployments across financial services, pharmaceuticals, government, and telecommunications.

Presentations

Apache Hadoop operations for production systems Tutorial

Jayesh Seshadri, Justin Hancock, Mark Samson, and Wellington Chevreuil offer a full-day deep dive into all phases of successfully managing Hadoop clusters—from installation to configuration management, service monitoring, troubleshooting, and support integration—with an emphasis on production systems.

Alan Hannaway is the product owner for data at 7digital, where he is responsible for ensuring the company is developing and extracting value from its line of data products. Prior to 7digital, Alan worked in a variety of roles, most recently providing data to the entertainment industry through his own startup. Alan started his career working as a researcher in computer science, focusing his interests on the application of technology to measure the scale and distribution of content consumption on large Internet networks.

Presentations

What’s next for music services? The answer is in the data Session

Can our real-time distributed data systems help predict whether high-resolution audio is the future of digital music? What about content curation? Paul Shannon and Alan Hannaway explore the future of music services through data and explain why 7digital believes well-curated, high-resolution listening experiences are the future of digital music services.

Phil Harvey is CTO and a founding member of DataShaka. He’s also a big beardy geek.

Presentations

20 percent blissful, 80 percent ignorance Session

Data is all sales and marketing. The reality of data work is pain. Most data projects fail and are horrible experiences to work on. Phil Harvey explains that data is just too hard—the world needs to talk about real challenges so that we can start tackling them to deliver data projects that work. This is DataOps; there will be tears before bedtime.

Joseph M. Hellerstein is the Jim Gray Chair of Computer Science at UC Berkeley and cofounder and CSO at Trifacta. Joe’s work focuses on data-centric systems and the way they drive computing. He is an ACM fellow, an Alfred P. Sloan fellow, and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.

Presentations

Data relativism and the rise of context services Keynote

The traditional data warehouse of the 1990s was quaintly called the “single source of truth.” Joe Hellerstein explains why today we take a far more relativistic view: the meaning of data depends on the context in which it is used.

Brian Hills is the head of data at The Data Lab, where he is responsible for developing and leading the strategy to create a national hub for data science in Scotland. Brian has 18 years of industry experience in engineering and analytics across a number of roles and domains including telecoms, IT, and ecommerce. Before The Data Lab, he was most recently with Skyscanner, where he launched and scaled the analytics team.

Presentations

Experiments in The Data Lab: Creating a national hub for data science in Scotland Session

The Data Lab is an innovation center that delivers social and economic benefit to Scotland by bringing industry, the public sector, and academia together to exploit new opportunities from data. Brian Hills shares insights and lessons learned during the center's first 18 months, organized into three themes: collaborative innovation, nurturing skills and talent, and community building.

Mads Hjorth is a datalogist working within the central public administration in Denmark. For the last decade, he has contributed to the national digitization project as an IT architect, data architect, programmer, teacher, and end user of public IT systems. Lately, Mads has been involved in the formulation of e-government strategies and frameworks on both the national and European levels.

Presentations

Denmark is data driven Session

Mads Hjorth offers a glimpse of a world-class digital public administration, showcasing how data has transformed the Danish public administration and its services toward citizens and businesses, and issues a call for cross-border collaboration to effectively address central challenges using modern data technologies.

Robert Hogan is a data scientist at Tractable. Robert holds a PhD in theoretical particle physics and cosmology from King’s College London and an MSc in quantum fields and fundamental forces from Imperial College London and has been the recipient of numerous awards, including the Institute of Physics Earnshaw Medal 2011.

Presentations

Addressing the labeling bottleneck in deep learning for computer vision HDS

The bottleneck in computer vision is in creating sufficiently large, labeled training sets for tasks. Alexandre Dalyac and Robert Hogan address this issue through a combination of dimensionality reduction, information retrieval, and domain adaptation techniques packaged in a software product that acts as a human-algorithm interface to facilitate transfer of expertise from human to machine.

Jeff Holoman is a systems engineer at Cloudera. Jeff is a Kafka contributor and has focused on helping customers with large-scale Hadoop deployments, primarily in financial services. Prior to his time at Cloudera, Jeff worked as an application developer, system administrator, and Oracle technology specialist.

Presentations

When it absolutely, positively has to be there: Reliability guarantees in Kafka Session

Kafka provides the low latency, high throughput, high availability, and scale that financial services firms require. But can it also provide complete reliability? Gwen Shapira and Jeff Holoman explain how developers and operation teams can work together to build a bulletproof data pipeline with Kafka and pinpoint all the places where data can be lost if you're not careful.

.

Presentations

Demonstrating the art of the possible with Spark and Hadoop Session

Apache Spark is on fire. Over the past five years, more and more organizations have looked to leverage Spark to operationalize their teams and the delivery of analytics to their respective businesses. Adrian Houselander and Joy Spohn demonstrate two use cases of how Apache Spark and Apache Hadoop are being used to harness valuable insights from complex data across cloud and hybrid environments.

Jeroen Janssens is the founder of Data Science Workshops, which provides on-the-job training and coaching in data visualization, machine learning, and programming. One day a week, Jeroen is also an assistant professor at Jheronimus Academy of Data Science. Previously, he was a data scientist at Elsevier in Amsterdam and at the startups YPlan and Outbrain in New York City. He is the author of Data Science at the Command Line, published by O’Reilly. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Presentations

The polyglot data scientist Session

A polyglot is a person who knows and is able to use several languages. There are a plethora of programming languages and computing environments available for working with data, and some data science projects require using multiple languages together. Jeroen Janssens discusses three approaches to become a polyglot data scientist.

Dan Jermyn joined the Royal Bank of Scotland in 2012. After stints in digital analytics and optimization, Dan is now driving new value streams for the bank and its customers as head of big data. A poacher-turned-gamekeeper, Dan learned his trade as a consultant and then head of analytics for an agency, where he led engagements with a host of major corporate and governmental organizations. In addition, he is a pioneer in digital marketing technology, having cofounded the SiteTagger platform, acquired by BrightTag (now Signal) in 2012.

Presentations

Improving the customer experience with big data wrangling on Hadoop Session

Big data provides an unprecedented opportunity to really understand and engage with your customers, but only if you have the keys to unlock the value in the data. Through examples from the Royal Bank of Scotland, Dan Jermyn and Connor Carreras explain how to use data wrangling to harness the power of data stored on Hadoop and deliver personalized interactions to increase customer satisfaction.

Flavio Junqueira leads the Pravega team at DellEMC. He is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, he held an engineering position with Confluent and research positions with Yahoo Research and Microsoft Research. Flavio is an active contributor of Apache projects, such as Apache ZooKeeper (PMC and committer), Apache BookKeeper (PMC and committer), and Apache Kafka, and he coauthored the O’Reilly ZooKeeper book. Flavio holds a PhD degree in computer science from the University of California, San Diego.

Presentations

Ask me anything: Apache Kafka Ask Me Anything

Ian Wrigley, Neha Narkhede, and Flavio Junqueira field a wide range of detailed questions about Apache Kafka. Even if you don’t have a specific question, join in to hear what others are asking.

Making sense of exactly-once semantics Session

Exactly-once semantics is a highly desirable property for streaming analytics. Ideally, all applications process events once and never twice, but making such guarantees in general either induces significant overhead or introduces other inconveniences, such as stalling. Flavio Junqueira explores what's possible and reasonable for streaming analytics to achieve when targeting exactly-once semantics.

Chris Kammermann grew up on the edge of the Australian outback with a pet Kangaroo and a talent for sheep shearing (current record: 15 in one day). Chris ventured over to the UK on a backpack visa in 2008 and has been working in British IT ever since. The team lead for service engineering infrastructure at Shazam since 2014, Chris describes himself as a jack of all trades and a general nice guy.

Presentations

The datafication and "datafuncation" of our business

Chris Kammermann explores how big data is guiding the future of one of the world’s most popular apps, Shazam, focusing on how the company is empowering employees to have fun with big data through tools like Splunk—in turn helping to create new products and revenue streams, as well as the term “datafuncation."

Holden Karau is a transgender Canadian Apache Spark committer, an active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo.

Presentations

Beyond shuffling: Tips and tricks for scaling Spark jobs Session

Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production, covering topics like when to use DataFrames, tuning, and working with key skew.

James Kinley is a principal solutions architect at Cloudera and has been involved with Hadoop since early 2010. James joined Cloudera from the UK defense industry, where he specialized in cybersecurity. James now works with Cloudera’s customers across EMEA to help them succeed in their Hadoop endeavors.

Presentations

Ask me anything: Hadoop operations Ask Me Anything

Mark Samson, Jayesh Seshadri, Wellington Chevreuil, and James Kinley, the instructors of the the full-day tutorial Apache Hadoop Operations for Production Systems, field a wide range of detailed questions about Hadoop, from debugging and tuning across different layers to tools and subsystems to keep your Hadoop clusters always up, running, and secure.

Maciej Klimek is a senior data scientist at deepsense.io, where he uses his machine-learning expertise to solve the real-world problems faced by the company’s clients. He has taken part in several Kaggle competitions, recently winning the Right Whale Recognition challenge sponsored by the National Oceanic and Atmospheric Administration (NOAA). During his studies at the University of Warsaw, he took part in numerous programming competitions such as Topcoder and ACM.

Presentations

Which whale is it anyway? Face recognition for right whales using deep learning Session

With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science community, NOAA Fisheries organized a competition hosted on Kaggle.com. Robert Bogucki and Maciej Klimek outline the winning solution.

Kenn Knowles is a founding committer of Apache Beam (incubating). Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.

Presentations

Ask me anything: Stream processing Ask Me Anything

Apache Beam/Google Cloud Dataflow engineers Tyler Akidau, Kenneth Knowles, and Slava Chernyak will be on hand to answer a wide range of detailed questions about stream processing. Even if you don’t have a specific question, join in to hear what others are asking.

Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines Session

Drawing on important real-world use cases, Kenneth Knowles delves into the details of the language- and runner-independent semantics developed for triggers in Apache Beam, demonstrating how the semantics support the use cases as well as all of the above variability in streaming systems. Kenneth then describes some of the particular implementations of those semantics in Google Cloud Dataflow.

Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.

Presentations

Data modeling for data science: Simplify your workload with complex types Session

Marcel Kornacker explains how nested data structures can increase analytic productivity, using the well-known TPC-H schema to demonstrate how to simplify analytic workloads with nested schemas.

Anirudh Koul is a senior data scientist at Microsoft AI and Research. An entrepreneur at heart, he has been running a mini-startup team within Microsoft, prototyping ideas using computer vision and deep learning techniques for augmented reality, productivity, and accessibility, building tools for communities with visual, hearing, and mobility impairments. Anirudh brings a decade of production-oriented applied research experience on petabyte-scale social media datasets, including Facebook, Twitter, Yahoo Answers, Quora, Foursquare, and Bing. A regular at hackathons, he has won close to three dozen awards, including top-three finishes for three years consecutively in the world’s largest private hackathon, with 16,000 participants. Some of his recent work, which IEEE has called “life changing,” has been showcased at a White House AI event, Netflix, and National Geographic and to the Prime Ministers of Canada and Singapore.

Presentations

Beyond guide dogs: How advances in deep learning can empower the blind community Session

Anirudh Koul and Saqib Shaik explore cutting-edge advances at the intersection of vision, language, and deep learning that help the blind community "see" the physical world and explain how developers can utilize this state-of-the-art image-captioning and computer-vision technology in their own applications.

Eric Kramer left medical school to join Dataiku, a big data startup in Paris. Eric specializes in the analysis of medical data and the possibilities at the intersection of medicine, data, and predictive analytics.

Presentations

Real-time epilepsy monitoring with smart clothing: A case study in time series, open source technology, and connected devices Session

Dataiku and Bioserenity have built a system for an at-home, real-time EEG and, in the process, created an open source stack for handling the data from connected devices. Eric Kramer offers an overview of the tools Dataiku and Bioserenity use to handle large amounts of time series data and explains how they created a real-time web app that processes petabytes of data generated by connected devices.

Scott Kurth is the vice president of advisory services at Silicon Valley Data Science, where he helps clients define and execute the strategies and data architectures that enable differentiated business growth. Building on 20 years of experience making emerging technologies relevant to enterprises, he has advised clients on the impact of technological change, typically working with CIOs, CTOs, and heads of business. Scott has helped clients drive global technology strategy, conduct prioritization of technology investments, shape alliance strategy based on technology, and build solutions for their businesses. Previously, Scott was director of the Data Insights R&D practice within the Accenture Technology Labs, where he led a team focused on employing emerging technologies to discover the insight contained in data and effectively bring that insight to bear on business processes to enable new and better outcomes and even entire new business models, and led the creation of Accenture’s annual analysis of emerging technology trends impacting the future of IT, Technology Vision, where he was responsible for tracking emerging technologies, analyzing their transformational potential, and using it to influence technology strategy for both Accenture and its clients.

Presentations

Developing a modern enterprise data strategy Tutorial

Big data and data science have great potential to accelerate business, but how do you reconcile the opportunity with the sea of possible technologies? Conventional data strategy offers little to guide us, focusing more on governance than on creating new value. Scott Kurth and John Akred explain how to create a modern data strategy that powers data-driven business.

Pierre Lacave is a senior software engineer at Corvil. Pierre has contributed to more than 200 Corvil analytics plugins to decode, analyze, and enrich network data in enterprise and electronic trading infrastructure, and he is also lead developer in the integration of the Corvil flagship product with the Hadoop/Spark ecosystem. Pierre studied computer science in Paris, France, and Chongqing, China, and holds a double MSC in software engineering.

Presentations

Using Spark and Hadoop in high-speed trading environments Session

Fergal Toomey and Pierre Lacave demonstrate how to effectively use Spark and Hadoop to reliably analyze data in high-speed trading environments across multiple machines in real time.

Mounia Lalmas is a director of research at Yahoo Labs London, where she leads a team of scientists working on advertising sciences. Mounia also hold an honorary professorship at University College London. Her work focuses on studying user engagement in areas such as native advertising, digital media, social media, and search.

Presentations

Mobile advertising: The preclick experience HDS

Mounia Lalmas offers an overview of work aimed at understanding the user preclick experience of ads and building a learning framework to identify ads with low preclick quality.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Pig. Julien was previously an architect at Dremio and the tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

The future of column-oriented data processing with Arrow and Parquet Session

In pursuit of speed and efficiency, big data processing is continuing its logical evolution toward columnar execution. Julien Le Dem offers a glimpse into the future of column-oriented data processing with Arrow and Parquet.

Xavier Léauté is a software engineer at Confluent as well as a founding Druid committer and PMC member. Prior to his current role he headed the backend engineering team at Metamarkets.

Presentations

Streaming analytics at 300 billion events per day with Kafka, Samza, and Druid Session

Xavier Léauté shares his experience and relates the challenges scaling Metamarkets's real-time processing to over 3 million events per second. Built entirely on open source, the stack performs streaming joins using Kafka and Samza and feeds into Druid, serving 1 million interactive queries per day.

Alex Leblang is an engineer at Cloudera on the RecordService team. Previously, Alex was an Apache Impala (incubating) engineer and interned at Vertica. He holds a bachelor’s degree from Brown University with concentrations in computer science and Latin American studies.

Presentations

Simplifying Hadoop with RecordService, a secure and unified data access path for compute frameworks Session

Hadoop is supremely flexible, but with that flexibility comes integration challenges. Alex Leblang introduces RecordService, a new service that eliminates the need for components to support individual file formats, handle security, perform auditing, and implement sophisticated I/O scheduling and other common processing at the bottom of any computation.

Abigail Lebrecht is principal analyst at uSwitch, where she focuses on using statistical and machine-learning techniques for descriptive analytics and modeling to understand customer behavior. Abigail has a background in probability and statistics and is passionate about encouraging an understanding of uncertainty in both big and small data. She holds a PhD in queueing theory from Imperial College London.

Presentations

Beyond the hunch: Communicating uncertainty for effective data-driven business Session

Data-driven decision making is still contentious, with decision makers skeptical that the data knows more than they do. Often they're right; if data is not communicated with a good understanding of the uncertainty, the findings can be meaningless. Abigail Lebrecht uses Bayesian and frequentist techniques to highlight bad data communication in business and the media and shows how to get it right.

Yuelin Li is the fellowship manager at ASI, partnering with companies to transform PhDs and postdocs into effective data scientists. Before joining ASI, Yuelin launched the made-to-order business at furniture ecommerce startup Swoon Editions and worked as a credit research analyst at J.P. Morgan. Yuelin holds a BA in economics from the University of Cambridge.

Presentations

Developing data scientists: Breaking the skills cap Data 101

Much has been made of the skills gap—the lack of advanced numerical, statistical, and coding expertise necessary to become a data scientist. ASI Data Science's Yuelin Li discusses a related concept: the skills cap—the career limitation that is imposed on a technically skilled junior data scientist if they cannot execute effectively in a commercial environment.

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine-learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Presentations

Hadoop's storage gap: Resolving transactional access/analytic performance trade-offs with Apache Kudu (incubating) Session

Todd Lipcon investigates the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals and offers an overview of Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.

Office Hour with Todd Lipcon (Cloudera, Inc.) Office Hours

Todd will answer questions about Apache Kudu, how to best design a schema and partitioning in Kudu for your use case, and how to become a Kudu contributor.

Lucian Lita is the director of data engineering at Intuit, where he leads a big data platform and large-scale real-time data services group in the US and the EU. Previously, Lucian founded Level Up Analytics, a premier big data and data science firm focused on building data products. At BlueKai, Lucian led the Engineering & Analytics team focused on big data, real-time audience management, and analytics. Earlier, he led information extraction and medical data search efforts within Siemens Healthcare. Lucian holds a PhD in computer science from Carnegie Mellon, where he focused on applied machine learning.

Presentations

Intuit, Uber, and Etsy: Scaling innovation with A/B testing Session

A data-driven culture empowers companies to deliver greater value to their customers, yet many organizations still struggle to break down cultural barriers and drive data-driven innovation across their products. Lucian Lita, Mita Mahadevan, Shalin Mantri, and Gabrielle Gianelli explore Intuit's, Uber's, and Etsy's A/B platforms, which enable experimentation and engender a data-driven mentality.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

The next 10 years of Apache Hadoop Session

Ben Lorica hosts a conversation with Hadoop cofounder Doug Cutting and Tom White, an early user and committer of Apache Hadoop.

Roger Magoulas is the research director at O’Reilly Media and chair of the Strata + Hadoop World conferences. Roger and his team build the analysis infrastructure and provide analytic services and insights on technology-adoption trends to business decision makers at O’Reilly and beyond. He and his team find what excites key innovators and use those insights to gather and analyze faint signals from various sources to make sense of what others may adopt and why.​

Presentations

Friday keynote welcome Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Thursday keynote welcome Keynote

Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Mita Mahadevan leads the development of data products at Intuit’s Data Engineering and Analytics (IDEA) group. Mita started her career building distributed analytic systems to analyze billions of retail transactions. Her experience spans several domains, from retail analytics at Demandtec (IBM) to social network analysis at Ning. Some notable data products she has helped build include automated attribution for retail pricing, detecting growth, and diffusion patterns in online communities. Mita mentors and advises students at Hackbright and a few of the big data fellowship programs and has presented at the GHC and other industry meetups and conferences. Her hobbies include applying management principles to parenting her twin boys.

Presentations

Intuit, Uber, and Etsy: Scaling innovation with A/B testing Session

A data-driven culture empowers companies to deliver greater value to their customers, yet many organizations still struggle to break down cultural barriers and drive data-driven innovation across their products. Lucian Lita, Mita Mahadevan, Shalin Mantri, and Gabrielle Gianelli explore Intuit's, Uber's, and Etsy's A/B platforms, which enable experimentation and engender a data-driven mentality.

Seshadri Mahalingam is a software engineer at Trifacta, where, in addition to building out Wrangle, Trifacta’s domain-specific language for expressing data transformation, he develops the low-latency compute framework that powers Trifacta’s fluid and immersive data wrangling experience. Seshadri holds a BS in EECS from UC Berkeley, where he cotaught a class on open source software.

Presentations

Floating elephants: Developing data wrangling systems on Docker Session

Developers of big data applications face a unique challenge testing their software against a diverse ecosystem of data platforms that can be complex and resource intensive to deploy. Chad Metcalf and Seshadri Mahalingam explain why Docker offers a simpler model for systems by encapsulating complex dependencies and making deployment onto servers dynamic and lightweight.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera helping clients find success with the Hadoop ecosystem and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Jonathan Seidman, Ted Malaska, and Gwen Shapira, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Hadoop application architectures: Fraud detection Tutorial

Jonathan Seidman, Mark Grover, Gwen Shapira, and Ted Malaska walk attendees through an end-to-end case study of building a fraud detection system, providing a concrete example of how to architect and implement real-time systems.

Introduction to Apache Spark for Java and Scala developers Session

Ted Malaska leads an introduction to basic Spark concepts such as DAGs, RDDs, transformations, actions, and executors, designed for Java and Scala developers. You'll learn how your mindset must evolve beyond Java or Scala code that runs in a single JVM as you explore JVM locality, memory utilization, network/CPU usage, optimization of DAGs pipelines, and serialization conservation.

Roberto completed his college studies in Economic in 1989 and immediately start working as software developer for Software AG Italia. In 1995 he started his freelance activity working as BI Architect and Analyst for Philips SPA Italy till 2004 before joining Philips BV (Corporate) as main BI consultant for M&S.
In 2008 he moved to UPC Broadband BV to cover the role of BA for the Data Management team.
In 2010 he become the manager of the Data Management team and created the BI strategy for the creation of the LG eDWH & OBIEE deployment. From 2012 he is responsible Insight & Analytics at LG CIO Delivery.

Presentations

A journey into big data and analytics Session

Liberty Global are the largest international cable company in the World. Roberto will take you on their journey in to BI and Big Data from proof of concept which lead them to an Oracle Big Data Appliance solution.

Arun Karthick Manoharan is a senior product manager at eBay, where he is currently responsible for building data platforms. Prior to eBay, Arun was a product manager for IBM Data Explorer and a product manager at Vivisimo.

Presentations

Apache Eagle: Secure Hadoop in real time Session

Apache Eagle is an open source monitoring solution to instantly identify access to sensitive data, recognize malicious activities, and take action. Arun Karthick Manoharan, Edward Zhang, and Chaitali Gupta explain how Eagle helps secure a Hadoop cluster using policy-based and machine-learning user-profile-based detection and alerting.

Shalin Mantri is the product lead for Uber’s experimentation platform. Previously, Shalin started and led Uber’s rider experience team, responsible for the iOS and Android apps that millions of people use every day. Prior to Uber, he built mobile analytics and A/B testing solutions at Upsight and also founded a mobile music startup that was acquired by Jawbone. He has an MS in management science and engineering and a BS in computer science from Stanford University.

Presentations

Intuit, Uber, and Etsy: Scaling innovation with A/B testing Session

A data-driven culture empowers companies to deliver greater value to their customers, yet many organizations still struggle to break down cultural barriers and drive data-driven innovation across their products. Lucian Lita, Mita Mahadevan, Shalin Mantri, and Gabrielle Gianelli explore Intuit's, Uber's, and Etsy's A/B platforms, which enable experimentation and engender a data-driven mentality.

Eliano Marques is the Global Data Science practice lead at Think Big. Eliano has successfully led teams and projects to develop and implement analytics platforms, predictive models, and analytics operating models and has supported many businesses making better decisions through the use of data. Recently, Eliano has been focusing on developing analytics solutions for customers around predictive asset maintenance, customer path analytics, and customer experience analytics with a main focus in utilities, telcos, and manufacturing. Eliano holds a degree in economics, an MSc in applied econometrics and forecasting, and several certifications in machine learning and data mining.

Presentations

Realizing the value of combining the IoT and big data analytics Session

The IoT combined with big data analytics enables organizations to track new patterns and signs and bring data together that previously was not only a challenge to integrate but also way too expensive. Frank Saeuberlich and Eliano Marques explain why data management, data integration, and multigenre analytics are foundational to driving business value from IoT initiatives.

Manuel Martin Marquez is a senior research and data scientist fellow at the European Organization for Nuclear Research, CERN. His current activities are focused on the development of new approaches for big data analytics applied to the CERN control system and the implementation of the CERN openlab data-analytics-as-a-service infrastructure (DAaaS). Manuel is member of the Soft Computing and Intelligent Information Systems (SCI2S) research group and the Distributed Computational Intelligence and Time Series lab (DICITS), both at the University of Granada, Spain.

Presentations

Modern data strategy and CERN Keynote

Cloudera’s Mike Olson is joined by Manuel Martin Marquez, a senior research and data scientist at CERN, to discuss Hadoop's role in the modern data strategy and how CERN is using Hadoop and information discovery to help drive operational efficiency for the Large Hadron Collider.

Louise Matthews is the senior manager of customer and solutions marketing for Hortonworks’s international business (EMEA and APJ). With nearly 20 years’ experience in the enterprise IT sector, Louise works closely with IT decision makers to understand what better business outcomes look like for their organizations—both horizontally and vertically—in order to drive visibility and awareness of the pioneers in big data. Previously, Louise led channel and field marketing at VCE. Prior to that, Louse held a number of roles at Citrix Systems, including the first solutions marketing management role in EMEA. Earlier in her career, Louise worked marketing agency-side for organizations including CSC and Fujitsu Services. You can follow Louise on Twitter at @LouiseMatthews.

Presentations

Business transformation and outcomes through big data Session

Louise Matthews covers industry trends and transformative business use cases drawn from a wide range of market sectors across Europe to bring the future of data to life.

A successful innovator and entrepreneur, Dave McCrory came to Basho from Warner Music Group, where, as SVP of engineering, he led the team that built their digital service platform. Dave was previously senior architect in Cloud Foundry at VMware and cloud architect at Dell. He also experienced successful exits from two companies he founded: Hyper9 (acquired by SolarWinds) and Surgient (acquired by Quest Software). Dave holds nine technology patents in virtualization, cloud, and systems management as co-inventor and created the “data gravity” concept, which states that as data accumulates, it’s more likely that other services and applications will be attracted to it.

Presentations

Data gravity and complex systems Session

Dave McCrory explores the concept of data gravity—the effect that as data accumulates, there is a greater likelihood that additional services and applications will be attracted to this data, essentially having the same effect gravity has on objects around a planet—and discusses how the giant cycle of expansion and use of data and services in the cloud is created and what to do about it.

Office Hour with Dave McCrory (Basho Technologies) Office Hours

If you want to understand data gravity, stop by to talk with Dave about data gravity cause and effects and data gravity dynamics in a market space.

Patrick McFadin is one of the leading experts in Apache Cassandra and data-modeling techniques. As a consultant and the chief evangelist for Apache Cassandra at DataStax, Patrick has helped build some of the largest and most exciting deployments in production. Prior to DataStax, he was chief architect at Hobsons, an education services company. There, Patrick spoke often on web application design and performance.

Presentations

An Introduction to time series with Team Apache Tutorial

We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, even as users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day using the powerful Team Apache: Apache Kafka, Spark, and Cassandra.

Jason McFall is the CTO at Privitar, a London startup using machine learning and statistical techniques to open up data for safe secondary use, without violating individual privacy. Jason has a background in applying machine learning to marketing automation and customer analytics. Before that, he was an experimental physicist, working on particle physics collider experiments.

Presentations

Protecting individual privacy in a data-driven world Session

With the analytic and predictive power of big data comes the responsibility to respect and protect individual privacy. As citizens, we should hold organizations to account; as data practitioners, we must find intelligent ways to analyze data without violating privacy. Jason McFall discusses privacy risks and surveys leading privacy-preserving analysis techniques.

Alyona Medelyan has been working on algorithms that make sense of language data for over a decade. Her passion lies in helping businesses to extracting useful knowledge from text. As part of her PhD she has proven that her open source algorithm, Maui, can be as accurate as people at finding keywords. She has worked with large multinationals like Cisco and Google, has lead R&D teams and consulted to small and large companies around the globe. Alyona now runs Thematic, a customer insight company.

Presentations

Applications of natural language understanding: Tools and technologies Session

With the rise of deep learning, natural language understanding techniques are becoming more effective and are not as reliant on costly annotated data. This leads to an explosion of possibilities of what businesses can do with language. Alyona Medelyan explains what the newest NLU tools can achieve today and presents their common use cases.

Chad Metcalf is a solutions engineering manager for Docker. Previously, Chad worked at Puppet Labs and was an infrastructure engineer at WibiData and Cloudera.

Presentations

Floating elephants: Developing data wrangling systems on Docker Session

Developers of big data applications face a unique challenge testing their software against a diverse ecosystem of data platforms that can be complex and resource intensive to deploy. Chad Metcalf and Seshadri Mahalingam explain why Docker offers a simpler model for systems by encapsulating complex dependencies and making deployment onto servers dynamic and lightweight.

Piotr Mirowski is a research scientist at Google DeepMind. Previously, Piotr worked at Bell Labs and Microsoft Bing. Piotr has been trying to make sense of all sorts of sequential data, ranging from sensors (e.g., EEG for epilepsy prediction, smart meter logs for power demand prediction or WiFi, and inertial data for geolocalization) to text (speech recognition or search query completion). Piotr studied computer science at ENSEEIHT in France, followed by a PhD in deep learning in Yann LeCun’s lab at New York University.

Presentations

Deep learning for web-scale text HDS

Piotr Mirowski looks under the hood of recurrent neural networks and explains how they can be applied to speech recognition, machine translation, sentence completion, and image captioning.

Sherry Moore is a software engineer on the Google Brain team. Her other projects at Google include Google Fiber and Google Ads Extractor. Previously, she spent 14 years as a systems and kernel engineer at Sun Microsystems.

Presentations

TensorFlow: Machine learning for everyone Session

TensorFlow is an open source software library for numerical computation with a focus on machine learning. Its flexible architecture makes it great for research and production deployment. Sherry Moore offers a high-level introduction to TensorFlow and explains how to use it to train machine-learning models to make your next application smarter.

Todd Mostak is the founder of MapD. He is a graduate of Harvard’s Kennedy School of Government.

Presentations

The rise of the GPU: GPUs will change how you look at big data Session

GPU-based databases, visualization layers, and analytic platforms have an immense advantage over their CPU-bound counterparts. Todd Mostak explains how data scientists and analysts can execute and visualize SQL queries on billions of rows of data in milliseconds—up to 1,000x faster than legacy CPU systems—by leveraging the parallel processing power and memory bandwidth of GPUs.

Surya Mukherjee is a senior analyst for Ovum’s Information Management team responsible for the analysis of enterprises’ business intelligence technology investment priorities, market forecast models, and product and vendor evaluations. He is also responsible for the delivery of research-based consulting projects relating to the information management software markets. Based in London, Surya is a thought leader and has given keynotes at several global events.

Presentations

Big SQL: The future of in-cluster analytics and enterprise adoption Session

Hear why big SQL is the future of analytics. Experts at Yahoo, Knewton, FullStack Analytics, and Looker discuss their respective data architectures, the trials and tribulations of running analytics in-cluster, and examples of the real business value gained from putting their data in the hands of employees across their companies.

Ignacio Mulas is a researcher working in the area of cloud analytics at Ericsson Research. He is an experienced software engineer in cloud and data scientist. Lately, Ignacio has been interested in the development of streaming analytics pipelines following the Kappa architecture and their applicability to industrial use cases.

Presentations

Kappa architecture in the telecom industry Session

ICT systems are growing in size and complexity. Monitoring and orchestration mechanisms need to evolve and provide richer capabilities to help handle them. Ignacio Manuel Mulas Viela and Nicolas Seyvet analyze a stream of telemetry/logs in real time by following the Kappa architecture paradigm, using machine-learning algorithms to spot unexpected behaviors from an in-production cloud system.

Calum Murray is the chief data architect in the Small Business group at Intuit. Calum has 20 years’ experience in software development, primarily in the finance and small business spaces. Over his career, he has worked with various languages, technologies, and topologies to deliver everything from real-time payments platforms to business intelligence platforms.

Presentations

Analytics: A first-class architectural concern in a SaaS platform Session

As Intuit evolved QuickBooks, Payroll, Payments, and other product offerings into a SaaS business and an open cloud platform, it quickly became apparent that business analytics could no longer be treated as an afterthought but had to be part of the platform architecture as a first-class concern. Calum Murray outlines key design considerations when architecting analytics into your SaaS platform.

Raghunath Nambiar is the CTO for Cisco UCS, where he helps define strategies for next-generation architectures, systems, and data center solutions and leads a team of engineers and product leaders focused on emerging technologies and solutions. Raghu’s current focus areas include emerging technologies, data center solutions, and big data and analytics strategy. He is Cisco’s representative for standards bodies for system performance, has served on several industry standard committees for performance evaluation and program committees of leading academic conferences, and chaired the industry’s first standards committee for benchmarking big data systems. Raghu has years of technical accomplishments with significant expertise in system architecture, performance engineering, and creating disruptive technology solutions. He is a member of the IEEE big data steering committee, serves board of directors of the Transaction Processing Performance Council (TPC), and is founding chair of its International Conference Series on Performance Evaluation and Benchmarking. He has published 50+ peer-reviewed papers and book chapters. Raghu holds master’s degrees from University of Massachusetts and Goa University and completed an advanced management program from Stanford University.

Presentations

Avoid big data becoming a big problem Session

Raghunath Nambiar reviews the big data landscape, reflects on big data lessons learned in enterprise over the last few years, and explores how these organizations avoid their big data environments becoming unmanageable by using simplex management for deployment, administration, monitoring, and reporting no matter how much the environment scales.

Neha Narkhede is the cofounder and head of engineering at Confluent, a company backing the popular Apache Kafka messaging system. Prior to founding Confluent, Neha led streams infrastructure at LinkedIn, where she was responsible for LinkedIn’s petabyte-scale streaming infrastructure built on top of Apache Kafka and Apache Samza. Neha specializes in building and scaling large distributed systems and is one of the initial authors of Apache Kafka. A distributed systems engineer by training, Neha works with data scientists, analysts, and business professionals to move the needle on results.

Presentations

Ask me anything: Apache Kafka Ask Me Anything

Ian Wrigley, Neha Narkhede, and Flavio Junqueira field a wide range of detailed questions about Apache Kafka. Even if you don’t have a specific question, join in to hear what others are asking.

Introducing Kafka Streams, Apache Kafka's new stream processing library Session

Neha Narkhede offers an overview of Kafka Streams, a new stream processing library natively integrated with Apache Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such, it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka.

Oliver Newbury is CTO of BT Security, the organization responsible for securing both BT and its customers globally. Oliver is responsible for defining and driving delivery of the technical strategy and capability roadmap for security within BT to ensure BT’s own approach and portfolio of security products and services stays ahead of the constantly evolving threat landscape. Oliver leads an expert global team of cybersecurity specialists, architects, and engineers, whose role includes driving pull-through of innovative security capability, evaluating new security technology and partners, cybersecurity service design, architectural authority for BT’s security capability and portfolio, and the delivery of advanced cyberdefense solutions and services to BT’s clients globally.

Oliver’s previous roles include establishing and leading BT Security’s cyber-consultancy practice aimed at helping major global clients better understand their security posture with a risk-based approach to cybersecurity; the establishment of a specialist Cyber Defence Services organization within BT providing a full spectrum of cyberdefense services to the government and defense sectors and working closely with the UK government to deliver the UK National Cyber Security Strategy. Earlier in his career, Oliver gained a broad base of technical expertise across software architecture, high-speed packet processing, complex systems engineering, service provider networking, security architecture, network forensics, malware analysis, big data, and data science.

Presentations

BT Assure Cyber: Enabling new revenue with Hadoop Session

The global cyberthreat landscape is a constantly evolving environment. Oliver Newbury outlines BT’s cybersecurity strategy and offers an overview of BT Assure Cyber—the big data solution it has built to protect its own data and help others protect themselves—which sorts through masses of data, stitches together subtle clues, and produces useful, actionable information for cybersecurity.

Piotr Niedzwiedz is a founder and CTO of deepsense.io, a big data science company based in Menlo Park, California, and Warsaw, Poland. Deepsense.io provides machine-learning and deep learning consulting and has developed Seahorse, a scalable data analytics workbench powered by Apache Spark, which lets users build data-processing workflows without needing to write any code. Piotr is a successful entrepreneur. Prior to deepsense.io, he cofounded CodiLime, an IT company delivering software services in networks and security areas. Previously, he worked as a software engineer at Google and Facebook on projects related to big data and distributed systems. He supports and invests in startups using machine-learning solutions, such as Dealavo.com. Piotr holds a Double Degree Program diploma in mathematics and computer science from the University of Warsaw, is a Polish Collegiate Programming Contest winner, and finished fourth in the 2009 ACM Central European Programming Contest.

Presentations

Saving whales with deep learning Keynote

Piotr Niedźwiedź explores how deepsense.io created the world’s best deep learning model for identifying individual right whales using aerial photography for the NOAA (National Oceanic and Atmospheric Administration) and explains what happened when the solution was covered by news media around the globe.

Kim Nilsson is the CEO of Pivigo, a London-based data science marketplace and training provider responsible for S2DS, Europe’s largest data science training program, which has by now trained more than 340 fellows working on over 85 commercial projects with 60+ partner companies, including Barclays, KPMG, Royal Mail, News UK, and Marks & Spencer. An ex-astronomer turned entrepreneur with a PhD in astrophysics and an MBA, Kim is passionate about people, data, and connecting the two.

Presentations

Data scientists everywhere DDBD

Just as the cloud revolutionized how companies distribute and manage their data, the move from physical offices to geodistributed teams is revolutionizing hiring and work practices. Kim Nilsson explains how Pivigo’s S2DS data science program broadened its reach by running online for geodistributed data scientists across Europe and shares practical details of what did and didn’t work.

Office Hour with Kim Nilsson (Pivigo) Office Hours

If you are creating your organization's data strategy, time spent with Kim could be invaluable. Stop by to discuss how to set up a data science/analytics team, what skills are needed, how to attract and retain talent and organize analytics teams, including geodistributed teams, or how to become a data scientist.

Steven Noels is the SVP of product at NGDATA, where he is responsible for NGDATA’s overall product strategy and roadmap. Steven cofounded Outerthought—now known as NGDATA—and is the original designer of the Lily platform, which sits at the core of the NGDATA software portfolio. Outerthought was nominated Cool Vendor in Enterprise Content Management by Gartner in 2010. Steven has 15 years of product management experience and is extensively networked with the open source community in and around the Apache Hadoop big data ecosystem. Prior to NGDATA, Steven held senior roles in technology consulting and product management with Alcatel and Wolters-Kluwer, specializing in complex and large-scale data management problems and content publishing. He’s a member of the Apache Software Foundation and holds a board position in the GentBC innovation platform.

Presentations

The future is now: Leveraging Hadoop for real-time, predictive insights Session

Steven Noels explains how to prime the Hadoop ecosystem for real-time data analysis and actionability, examining ways to evolve from batch processing to real-time stream-based processing.

Erik Nygard is a cofounder at Limejump Ltd., where he is responsible for strategy and business development. Erik has substantial experience in electricity trading, hedging, and optimization.

Presentations

Harnessing big data to transform the energy sector DDBD

In order to move away from carbon-intensive fossil fuels to a greener generation mix, a transformational shift toward a smarter energy system is needed. This shift requires an unprecedented amount of data to be processed to unlock hidden capacity and demand flexibility on the network. Erik Nygard explains why this is only possible with disruptive energy tech and advanced analytics.

Kate O’Neill, founder and CEO of KO Insights, is a tech humanist and cultural strategist focused on meaningfulness in data, technology, business, and life overall. In 2009, Kate launched and grew [meta]marketer, a digital strategy and analytics firm, over a five-year period and significantly shaped the marketing analytics landscape. Prior to [meta]marketer, Kate’s experience included creating the first content management role at Netflix, leading cutting-edge online optimization work at Magazines.com, developing Toshiba America’s first intranet, building the first website at the University of Illinois at Chicago, and holding leadership positions in a variety of digital content and technology startups. Kate has been featured in CNN Money, Time, Forbes, USA Today, and other national media. She is the author of an upcoming book on meaningfulness in marketing. Kate is a vocal and visible advocate for women in technology, entrepreneurship, and leadership—she was featured in Google’s global campaign for women in entrepreneurship.

Presentations

Pixels and place: What online experiences can borrow from offline spaces and vice versa Session

The metaphors used online have always borrowed heavily from the offline world, but as our online and offline worlds converge, the biggest opportunities for innovative experiences will come from blending them intentionally. Kate O’Neill examines how the meaning and understanding of place relates to identity, culture, and intent and how we can shape our audiences' experiences more meaningfully.

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of @WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Office Hour with Stephen O'Sullivan (Silicon Valley Data Science) Office Hours

Stephen is happy to answer questions about how to create a data platform supporting batch and interactive and real-time analytical workloads as well as tool selection and how to integrate Hadoop components with legacy systems.

Francesca Odone is an associate professor of computer science at the University of Genova, Italy. Francesca’s research interests are in the fields of computer vision and machine learning. In particular, most of her research activity in recent years has been devoted to finding good visual representations able to capture the complexity of a problem, while allowing for the design of systems with the ability to perform their visual tasks in real time. In this respect, she has been involved in learning representations for high-dimensional data, (structured) feature selection, dimensionality reduction, support set estimation, visual recognition pipelines for object detection, retrieval, and recognition in images and image sequences, algorithms for behavior understanding, and action recognition. Francesca received a laurea degree in information sciences and a PhD in computer science, both from the University of Genova. She was a visiting student at Heriot-Watt University, Edinburgh, UK, with a EU Marie Curie research grant, as well as a researcher at the Italian National Institute for Solid State Physics. Besides theory and algorithms, Francesca also enjoys playing with real-world applications. Over the years, she has been a scientific coordinator of technology transfer and applied research projects.

Presentations

Visual data analysis for intelligent machines HDS

Francesca Odone explores analyzing visual data (images and videos) with the purpose of extracting meaningful information to solve different scene-understanding tasks. Francesca addresses the problem of learning adaptive data representations and covers different application scenarios, including human-robot interaction, activity recognition, and object categorization.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Modern data strategy and CERN Keynote

Cloudera’s Mike Olson is joined by Manuel Martin Marquez, a senior research and data scientist at CERN, to discuss Hadoop's role in the modern data strategy and how CERN is using Hadoop and information discovery to help drive operational efficiency for the Large Hadron Collider.

Gilad Olswang leads the healthcare initiatives in the big data organization at Intel. Within this role, he manages large programs that span from wearable data for cardiologic research to genomics and imaging analysis for cancer research. Prior to this role, Gilad led multiple teams in Intel focusing big data analytics and machine learning targeted at optimizing and accelerating internal R&D.

Presentations

Analytics innovation in cancer research Keynote

Federated analytics, a new approach to analyzing big data, supports unprecedented collaboration across large distributed datasets that contain proprietary and/or protected information. Gilad Olswang explains how Intel harnesses the power of federated analytics in the Collaborative Cancer Cloud project.

Federated analytics innovation in cancer research Session

Federated analytics, a new approach to analyzing big data, balances privacy, autonomy, IP protection and supports unprecedented collaboration across large distributed datasets that contain proprietary and/or protected information. Gilad Olswang explains how Intel harnesses the power of federated analytics in the Collaborative Cancer Cloud project.

Todd Palino is a site reliability engineer at LinkedIn tasked with keeping Zookeeper, Kafka, and Samza deployments fed and watered. His days are spent, in part, developing monitoring systems and tools to make that job a breeze. Previously, Todd was a systems engineer at Verisign, where he developed service-management automation for DNS, networking, and hardware management and managed hardware and software standards across the company.

Presentations

Office Hour with Gwen Shapira (Confluent) and Todd Palino (LinkedIn) Office Hours

Join Gwen Shapira, Todd Palino, and other Apache Kafka experts for a fast-paced conversation on Apache Kafka use cases, troubleshooting Apache Kafka, using Kafka in stream architectures, and when to avoid Kafka.

Putting Kafka into overdrive Session

Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Gwen Shapira and Todd Palino explain the right approach for getting the most out of Kafka, exploring how to monitor, optimize, and troubleshoot performance of your data pipelines from producer to consumer and from development to production.

I am a Hadoop Big Data consultant working on Data Science, Advance Analytics, Predictive Modeling, Machine learning and IOT.

I have designed and implemented various Hadoop Big Data products details of which can be found at http://ngvtech.in

Presentations

Best practices and solutions to manage and govern a multinational big data platform Session

As companies seek to expand their global data footprint, many new challenges arise. How can data be shared across countries? How can a company go about managing all of the policies and regulations specific to each country that customers reside in? Clara Fletcher explores best practices and lessons learned in international data management and security.

Andy Petrella is a mathematician turned distributed computing entrepreneur. Besides being a Scala/Spark trainer, Andy participated in many projects built using Spark, Cassandra, and other distributed technologies in various fields including geospatial analysis, the IoT, and automotive and smart cities projects. Andy is the creator of the Spark Notebook, the only reactive and fully Scala notebook for Apache Spark. In 2015, Andy cofounded Data Fellas with Xavier Tordoir around their product the Agile Data Science Toolkit, which facilitates the productization of data science projects and guarantees their maintainability and sustainability over time. Andy is also member of the program committee for the O’Reilly Strata, Scala eXchange, Data Science eXchange, and Devoxx events.

Presentations

Deep learning and natural language processing with Spark Session

Deep learning is taking data science by storm, due to the combination of stable distributed computing technologies, increasing amounts of data, and available computing resources. Andy Petrella and Melanie Warrick show how to implement a Spark­-ready version of the long short­-term memory (LSTM) neural network, widely used in the hardest natural language processing and understanding problems.

Scala: The unpredicted lingua franca for data science Session

Andy Petrella and Dean Wampler explore what it means to do data science today and why Scala succeeds at coping with large and fast data where older languages fail. Andy and Dean then discuss the current ongoing projects in advanced data science that use Scala as the main language, including Splash, mic-cut problem, OptiML, needle (DL), ADAM, and more.

Stefanie Posavec is a designer for whom data is her favored material, with projects ranging from data visualization and information design to commissioned data art. Her personal work focuses on nontraditional representations of data derived from language, literature, or scientific topics, often using a hand-crafted approach. Her work has been exhibited internationally at major galleries including MoMA and The Storefront for Art & Architecture (New York), CCBB (Rio de Janeiro), the Science Gallery (Dublin), the V&A, the Science Museum, the Southbank Centre, and Somerset House (London), and Milan Design Week. Stefanie recently completed the year-long Dear Data drawing project with Giorgia Lupi, deemed the Best Dataviz Project and the Most Beautiful (the highest accolade) at the 2015 Kantar Information Is Beautiful Awards. A book of this project will be published in September 2016 by Particular Books (Penguin Random House UK) and Princeton Architectural Press (USA).

Presentations

Drawing insights from imperfection: A year of Dear Data Keynote

Stefanie Posavec recently completed a year-long drawing project with Giorgia Lupi called Dear Data, where each week they manually gathered and drew their data on a postcard to send to the other. Stefanie discusses the professional insights she gained spending a year on such an intensive personal data project.

Zoltan Prekopcsak is the vice president of big data at RapidMiner, the leader in modern analytics. He has experience in data-driven projects in various industries, including telecommunications, financial services, ecommerce, neuroscience, and many more. Previously, Zoltan was cofounder and CEO of Radoop before its acquisition by RapidMiner; a data scientist at Secret Sauce Partners, Inc., where he created a patented technology for predicting customer behavior; and a lecturer at Budapest University of Technology and Economics, his alma mater, with a focus on big data and predictive analytics. Zoltan has dozens of publications and is a regular speaker at international conferences.

Presentations

Best practices to extract value from Hadoop with predictive analytics Session

Turning big data into tangible business value can be a struggle even for highly skilled data scientists. Zoltan Prekopcsak outlines the best practices that make life easier, simplify the process, and implement results faster.

As the executive director at the Human Rights Data Analysis Group, Megan Price designs strategies and methods for statistical analysis of human rights data for projects in a variety of locations including Guatemala, Colombia, and Syria. Megan’s work in Guatemala includes serving as the lead statistician, since 2009, on a project in which she analyzes documents from the National Police Archive; she has also contributed analyses submitted as evidence in two court cases in Guatemala. Her work in Syria includes serving as the lead statistician and author on two recent reports, commissioned by the Office of the United Nations High Commissioner of Human Rights (OHCHR), on documented deaths in that country. Megan is a research fellow at the Carnegie Mellon University Center for Human Rights Science. She earned her PhD in biostatistics and a certificate in human rights from the Rollins School of Public Health at Emory University. Megan also holds an MS and BS in statistics from Case Western Reserve University.

Presentations

Machine learning for human rights advocacy: Big benefits, serious consequences Keynote

Megan Price demonstrates how machine-learning methods help us determine what we know, and what we don't, about the ongoing conflict in Syria. Megan then explains why these methods can be crucial to better understand patterns of violence, enabling better policy decisions, resource allocation, and ultimately, accountability and justice.

Richard Pugh is cofounder and chief data scientist at Mango.

Presentations

R and reproducible reporting for big data Tutorial

Aimee Gott, Mark Sellors, and Richard Pugh explore techniques for optimizing your workflow in R when working with big data, including how to efficiently extract data from a database, techniques for visualization and analysis, and how all of this can be incorporated into a single, reproducible report, directly from R.

Daniele Quercia is currently building the Social Dynamics group at Bell Labs in Cambridge, UK. Daniele’s research focuses on the area of urban informatics and has received best paper awards from Ubicomp 2014 and ICWSM 2015 as well as an honorable mention from ICWSM 2013. Previously, he was a research scientist at Yahoo Labs, a Horizon senior researcher at the University of Cambridge, and a postdoctoral associate at MIT. Daniele has been named one of Fortune magazine’s 2014 data all-stars and has spoken about “happy maps” at TED. He holds a PhD from UC London. His thesis was sponsored by Microsoft Research and was nominated for BCS best British PhD dissertation in computer science.

Presentations

Good city life Session

Daniele Quercia discusses the launch of Goodcitylife.org—a global group of like-minded people who are passionate about building technologies whose focus is not necessarily to create a smart city but to give a good life to city dwellers.

Karthik Ramasamy is the engineering manager and technical lead for real-time analytics at Twitter. Karthik is the cocreator of Heron and has more than two decades of experience working in parallel databases, big data infrastructure, and networking. He cofounded Locomatix, a company that specializes in real-time stream processing on Hadoop and Cassandra using SQL, which was acquired by Twitter. Before Locomatix, he had a brief stint with Greenplum, where he worked on parallel query scheduling. Greenplum was eventually acquired by EMC for more than $300M. Prior to Greenplum, Karthik was at Juniper Networks, where he designed and delivered platforms, protocols, databases, and high-availability solutions for network routers that are widely deployed in the Internet. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik has a PhD in computer science from UW Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Processing billions of events in real time with Heron Session

Heron has been in production at Twitter for nearly two years and is widely used by several teams for diverse use cases. Karthik Ramasamy describes Heron in detail, covering a few use cases in-depth and sharing the operating experiences and challenges of running Heron at scale.

Tom Reilly is the CEO of Cloudera. Tom has had a distinguished 30-year career in the enterprise software market. Previously, Tom was vice president and general manager of enterprise security at HP; CEO of enterprise security company ArcSight, where he led the company through a successful initial public offering and subsequent sale to HP; and vice president of business information services for IBM, following the acquisition of Trigo Technologies Inc., a master data management (MDM) software company, where he served as CEO. He currently serves on the boards of Jive Software, privately held Ombud Inc., ThreatStream Inc., and Cloudera. Tom holds a BS in mechanical engineering from the University of California, Berkeley.

Presentations

Apache Hadoop meets cybersecurity Keynote

The cybersecurity landscape is quickly changing, and Apache Hadoop is becoming the analytics and data management platform of choice for cybersecurity practitioners. Tom Reilly explains why organizations are turning toward the open source ecosystem to break down traditional cybersecurity analytics and data constraints in order to detect a new breed of sophisticated attacks.

As director of product management at MapR Technologies, Neeraja Rentachintala is responsible for the product strategy, roadmap, and requirements of MapR SQL initiatives. Prior to MapR, Neeraja held numerous product management and engineering roles at Informatica, Microsoft SQL Server, Oracle, and Expedia.com, most recently as the principal product manager for Informatica Data Services/Data Virtualization. Neeraja holds a BS in electronics and communications from the National Institute of Technology in India and is product management certified from the University of Washington.

Presentations

Adding complex data to the Spark stack Session

Neeraja Rentachintala discusses the latest integrations between Apache Drill and Spark technologies. Together, the combination allows Spark users to leverage Drill’s flexible schema and dynamic schema discovery capabilities to query and work with complex data directly using familiar Spark programming paradigms.

Stephane Rion is a senior data scientist at Big Data Partnership, where he helps clients get insight into their data by developing scalable analytical solutions in industries such as finance, gaming, and social services. Stephane has a strong background in machine learning and statistics with over 6 years’ experience in data science and 10 years’ experience in mathematical modeling. He has solid hands-on skills in machine learning at scale with distributed systems like Apache Spark, which he has used to develop production rate applications. In addition to Scala with Spark, Stephane is fluent in R and Python, which he uses daily to explore data, run statistical analysis, and build statistical models. He was the first Databricks-certified Spark instructor in EMEA. Stephane enjoys splitting his time between working on data science projects and teaching Spark classes, which he feels is the best way to remain at the forefront of the technology and capture how people are attempting to use Spark within their businesses.

Presentations

Spark foundations: Prototyping Spark use cases on Wikipedia datasets Training

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Stephane Rion employs hands-on exercises using explore various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

Spark foundations: Prototyping Spark use cases on Wikipedia datasets (Day 2) Training Day 2

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Stephane Rion employs hands-on exercises using explore various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.

Anne Sophie Roessler is deployment strategist at Dataiku, developer of Data Science Studio (DSS), which integrates all the capabilities required to build end-to-end highly specific services that turn raw data into business-impacting predictions quickly. From her experience in project management, Anne Sophie got the conviction that collaboration is a particularly relevant topic in the big data environment. She believes that in the future, all the stakeholders of data-driven projects will have to be able to work with data, whether they have technical skills or not. Anne Sophie graduated from ESCP Paris. She also studied classical singing and worked as an opera singer for a few years.

Presentations

What Esperanto can teach us about collaboration in the big data environment DDBD

Data-driven projects depend on a complex environment. You have many stakeholders with different skill sets involved, but all these skills are equally crucial to the project. Anne Sophie Roessler uses the example of the failed universal language Esperanto to explain how to help these stakeholders—most of whom use different languages and technologies and have different baselines—work together.

Duncan Ross is data and analytics director at TES Global. Duncan has been a data miner since the mid-1990s. Previously at Teradata, Duncan created analytical solutions across a number of industries, including warranty and root cause analysis in manufacturing and social network analysis in telecommunications. In his spare time, Duncan has been a city councilor, chair of a national charity, founder of an award-winning farmers’ market, and one of the founding directors of the Institute of Data Miners. More recently, he cofounded DataKind UK and regularly speaks on data science and social good.

Presentations

The best university in the world Session

In 2014, Times Higher Education made the decision to move from being a traditional publisher to being a data business. As part of the move, it needed to bring the creation of the World University Rankings in-house and build a set of data products from scratch. Duncan Ross and Francine Bennett explain how the transition was made and highlight the challenges and lessons learned.

Using data for evil IV: The journey home Session

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis. Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse.

Stuart Russell is a professor (and former chair) in the Electrical Engineering and Computer Sciences department at the University of California, Berkeley, where he holds the Smith-Zadeh Chair in Engineering. He is also an adjunct professor of neurological surgery at UC San Francisco and vice chair of the World Economic Forum’s Council on AI and Robotics. Stuart’s research covers a wide range of topics in artificial intelligence, including machine learning, probabilistic reasoning, knowledge representation, planning, real-time decision making, multitarget tracking, computer vision, computational physiology, global seismic monitoring, and philosophical foundations. His books include The Use of Knowledge in Analogy and Induction, Do the Right Thing: Studies in Limited Rationality (with Eric Wefald), and Artificial Intelligence: A Modern Approach (with Peter Norvig). His current concerns include the threat of autonomous weapons and the long-term future of artificial intelligence and its relation to humanity.

Stuart is a recipient of the Presidential Young Investigator Award of the National Science Foundation, the IJCAI Computers and Thought Award, the World Technology Award (policy category), the Mitchell Prize of the American Statistical Association and the International Society for Bayesian Analysis, the ACM Karlstrom Outstanding Educator Award, and the EAAI Outstanding Educator Award. In 1998, he gave the Forsythe Memorial Lectures at Stanford University and from 2012 to 2014, he held the Chaire Blaise Pascal in Paris. He is a fellow of the American Association for Artificial Intelligence, the Association for Computing Machinery, and the American Association for the Advancement of Science. Stuart holds a PhD in computer science from Stanford and a BA with first-class honors in physics from Oxford University.

Presentations

Office Hour with Stuart Russell (UC Berkeley) Office Hours

Want to debate (or just explore) the future of artificial intelligence? Stop by and talk with Stuart. It’s sure to be fascinating.

Panel: The future of intelligence Session

Stuart Russell and Jaan Tallinn explore and debate the future of artificial intelligence in a panel discussion moderated by Marc Warner.

The future of (artificial) intelligence Keynote

The news media in recent months has been full of dire warnings about the risk that AI poses to the human race. Should we be concerned? If so, what can we do about it? While some in the mainstream AI community dismiss these concerns, Stuart Russell argues that a fundamental reorientation of the field is required.

Frank Säuberlich is the director of data science and leads Teradata’s International Data Science team—a role combining demand generation across EMEA and APJ with analytical innovation. Frank previously worked at Urban Science International, where he was a regional manager responsible for customer analytics. In this role, he worked with client teams around the globe to implement analytical solutions and pioneered new types of analysis to improve the efficiency of automotive clients’ marketing efforts. Prior to that, as European customer solutions practice manager, he was responsible for the Urban Science Customer Solutions practice in Europe. Frank holds a PhD in economics and a master’s degree in economic mathematics from the University of Karlsruhe.

Presentations

Realizing the value of combining the IoT and big data analytics Session

The IoT combined with big data analytics enables organizations to track new patterns and signs and bring data together that previously was not only a challenge to integrate but also way too expensive. Frank Saeuberlich and Eliano Marques explain why data management, data integration, and multigenre analytics are foundational to driving business value from IoT initiatives.

Bikas Saha has been working in the Apache Hadoop ecosystem since 2011, focusing on YARN and the Hadoop compute stack, and is a committer/PMC member of the Apache Hadoop and Tez projects. Bikas is currently working on Apache Tez, a new framework to build high-performance data processing applications natively on YARN. He has been a key contributor in making Hadoop run natively on Windows. Prior to Hadoop, he worked extensively on the Dryad distributed data processing framework that runs on some of the world’s largest clusters as part of Microsoft’s Bing infrastructure.

Presentations

Why is my Hadoop job slow? Session

Hadoop is used to run large-scale jobs over hundreds of machines. Considering the complexity of Hadoop jobs, it's no wonder that Hadoop jobs running slower than expected remains a perennial source of grief for developers. Bikas Saha draws on his experience debugging and analyzing Hadoop jobs to describe the approaches and tools that can solve this difficult problem.

Kostas Sakellis is a software developer with Cloudera working on the core enterpise team. Kostas holds a bachelor’s degree in computer science from the University of Waterloo, Canada.

Presentations

Securing Apache Spark on production Hadoop clusters Session

As Spark is used more and more frequently for production workloads with stringent security requirements, fully locking down Spark applications has become critical. Kostas Sakellis explores the various facets of securing your Spark application.

Neelesh Srinivas Salian is a software engineer on the Data Platform Infrastructure team in the Algorithms group at Stitch Fix, helps build the ecosystem around Apache Spark. Previously, he worked at Cloudera where he worked with Apache projects like YARN, Spark, and Kafka. He holds a Master’s degree in Computer Science with a focus on Cloud Computing from North Carolina State University and a Bachelor’s degree in Computer Engineering from the University of Mumbai, India.

Presentations

Breaking Spark: Top five mistakes to avoid when using Apache Spark in production Session

Spark has been growing in deployments for the past year. The increasing amount of data being analyzed and processed through the framework is massive and continues to push the boundaries of the engine. Drawing on his experiences across 150+ production deployments, Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark.

Mark Samson is a principal systems engineer at Cloudera, helping customers solve their big data problems using enterprise data hubs based on Hadoop. Mark has 17 years’ experience working with big data and information management software in technical sales, service delivery, and support roles.

Presentations

Apache Hadoop operations for production systems Tutorial

Jayesh Seshadri, Justin Hancock, Mark Samson, and Wellington Chevreuil offer a full-day deep dive into all phases of successfully managing Hadoop clusters—from installation to configuration management, service monitoring, troubleshooting, and support integration—with an emphasis on production systems.

Ask me anything: Hadoop operations Ask Me Anything

Mark Samson, Jayesh Seshadri, Wellington Chevreuil, and James Kinley, the instructors of the the full-day tutorial Apache Hadoop Operations for Production Systems, field a wide range of detailed questions about Hadoop, from debugging and tuning across different layers to tools and subsystems to keep your Hadoop clusters always up, running, and secure.

Majken Sander is a data nerd, business analyst, and solution architect at TimeXtender. Majken has worked with IT, management information, analytics, BI, and DW for 20+ years. Armed with strong analytical expertise, She is keen on “data driven” as a business principle, data science, the IoT, and all other things data.

Presentations

My AlgorithmicMe knows me better than Google or my mum DDBD

Emphasizing the importance of higher awareness, education, and insight about the subjective algorithms that affect our lives, Majken Sander explores the value judgements built into algorithms, discusses their consequences, and presents possible solutions, including visionary concepts like an AlgorithmicMe that could raise awareness and guide developers, analysts, and data scientists.

My AlgorithmicMe: The "Who is. . .?" of the future Session

Who does your computer think I am? Today, every person is digitally represented in a multitude of IT systems, based on invisible algorithms that pervasively control pieces of our lives through decisions made based on our preferences, interests, and even future actions. Joerg Blumtritt and Majken Sander explore these judgments, discuss their consequences, and present possible solutions.

Krishna Sankar is a consulting data scientist working on retail analytics, social media data science, and forays into deep learning, as well as codeveloping the DeepLearnR package interfacing R over TensorFlow/Skflow. Previously, Krishna was a chief data scientist at Blackarrow.tv, where he focused on optimizing user experience via inference, intelligence, and interfaces. Earlier stints include principal architect/data scientist at Tata America Intl., director of data science at a bioinformatics startup, and distinguished engineer at Cisco. He is a frequent speaker at conferences, including Spark Summit, Spark Camp, OSCON, PyCon, and PyData, on topics such as predicting NFL winners, Spark, data science, machine learning, and social media analysis, as well as a guest lecturer at the Naval Postgraduate School. Krishna’s occasional blogs can be found at Doubleclix.wordpress.com. His other passion is Lego robotics. You will find him at the St. Louis First Lego League World Competition as a robot design judge.

Presentations

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX Tutorial

Jayant Shekhar, Vartika Singh, and Krishna Sankar explore techniques for building machine-learning apps using Spark ML as well as the principles of graph processing with Spark GraphX.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for six years.

Presentations

Legacy or Kafka? What an ideal messaging system should bring to Hadoop Session

Application messaging isn’t new. Solutions like message queues have been around for a long time, but newer solutions like Kafka have emerged as high-performance, high-scalability alternatives that integrate well with Hadoop. Should distributed messaging systems like Kafka be considered replacements for legacy technologies? Jim Scott answers that question by delving into architectural trade-offs.

Jonathan Seidman is a software engineer on the Partner Engineering team at Cloudera. Previously, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Jonathan Seidman, Ted Malaska, and Gwen Shapira, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Hadoop application architectures: Fraud detection Tutorial

Jonathan Seidman, Mark Grover, Gwen Shapira, and Ted Malaska walk attendees through an end-to-end case study of building a fraud detection system, providing a concrete example of how to architect and implement real-time systems.

David Selby is a senior data scientist with IBM’s Insight Cloud Services team based at IBM’s Hursley Park in the UK, where he specializes in the leveraging of advanced machine-learning techniques to provide business advantage. David is a “master inventor” with more than 50 patents in the area of big data and analytics. He has more than 28 years in the advanced analytics space and has worked in over 50 countries in a wide variety of industries. David is a fellow of the British Computer Society.

Presentations

The curious case of the data scientist Keynote

David Selby shares some of the challenges he has faced coercing meaning from data and explains why he is particularly enthusiastic about the latest technological developments in the data science field.

Chris Selland is vice president of business development for HPE Software’s Big Data Platform. In this role, he leads global strategic alliances for the HPE Haven platform, including HPE Vertica, HPE IDOL, and big data/Hadoop partnerships. Chris has more than 20 years of experience in driving innovative go-to-market initiatives and leading strategic alliance and corporate development for technology-enabled, high-growth businesses. Previously, Chris was VP of marketing for HP Vertica, where he led the launches of key initiatives such as the HP Haven platform, the HP Big Data Conference, and the HP Vertica Customer Advisory Board. Prior to HPE, Chris was senior VP of corporate development for Hale Global, a technology holding company focused on acquiring and operating special situations. Earlier in his career, he was VP of business development at SoundBite Communications, VP of CRM and Internet research at the Yankee Group, and founder of Reservoir Partners. He holds a bachelor’s degree in operations research and industrial engineering from Cornell University and a master’s of business administration in international business and economics from the New York University Stern School of Business.

Presentations

Empowering the data-driven organization Session

As we strive to realize big data's value, many seek more agile and capable analytic systems that ensure end-to-end security. Chris Selland and Richard Gascoigne explore Hewlett Packard Enterprise's robust yet flexible offering that scales with evolving needs, covering HPE's big data reference architecture, Vertica SQL on Hadoop, and machine learning as a service.

Mark Sellors is head of data engineering for Mango Solutions, where he helps clients run their data science operations in production-class environments. Mark has extensive experience in analytic computing and helping organizations in sectors from government to pharma to telecoms get the most from their data engineering environments.

Presentations

R and reproducible reporting for big data Tutorial

Aimee Gott, Mark Sellors, and Richard Pugh explore techniques for optimizing your workflow in R when working with big data, including how to efficiently extract data from a database, techniques for visualization and analysis, and how all of this can be incorporated into a single, reproducible report, directly from R.

Jayesh Seshadri is currently a technical lead for Cloudera Manager and Cloudera Backup and Disaster Recovery. Previously, Jayesh designed highly scalable cloud management systems at VeloCloud Networks and led several engineering initiatives for vCenter at VMware. Jayesh has a master’s degree in computer sciences from the University of Texas at Austin.

Presentations

Apache Hadoop operations for production systems Tutorial

Jayesh Seshadri, Justin Hancock, Mark Samson, and Wellington Chevreuil offer a full-day deep dive into all phases of successfully managing Hadoop clusters—from installation to configuration management, service monitoring, troubleshooting, and support integration—with an emphasis on production systems.

Ask me anything: Hadoop operations Ask Me Anything

Mark Samson, Jayesh Seshadri, Wellington Chevreuil, and James Kinley, the instructors of the the full-day tutorial Apache Hadoop Operations for Production Systems, field a wide range of detailed questions about Hadoop, from debugging and tuning across different layers to tools and subsystems to keep your Hadoop clusters always up, running, and secure.

Nicolas Seyvet is a passionate software developer at Ericsson AB. Nicolas has worked on a wide range of telco-grade (high-availability, scalable, redundant) applications for the telecom/multimedia business and is experienced in Java/JEE (10+ years) and C/C++ (7+ years) and with databases (SQL, NoSQL). He joined Ericsson Research to work on big data, the cloud, and analytics and built OpenStack and Hadoop/Spark clusters as well as some algorithms for RT data. Nicolas’s particular interests are coding, software engineering, software architecture, distributed and scalable systems, distributed processing, lean/agile methodologies, and the principles of good leadership. He specializes in software design and architecture of complex, high-performance systems, as well as leading high-performing cross-functional teams.

Presentations

Kappa architecture in the telecom industry Session

ICT systems are growing in size and complexity. Monitoring and orchestration mechanisms need to evolve and provide richer capabilities to help handle them. Ignacio Manuel Mulas Viela and Nicolas Seyvet analyze a stream of telemetry/logs in real time by following the Kappa architecture paradigm, using machine-learning algorithms to spot unexpected behaviors from an in-production cloud system.

Rachel Shadoan is the cofounder and CEO of Akashic Labs, a Portland-based research and development consultancy, where she specializes in combining research methodologies to provide rich and accurate answers to technology’s pressing questions. Questions about people are her favorite kinds of questions to answer. Prior to founding Akashic Labs, Rachel worked with Intel exploring both how people use their phones in cars and how the ability to convert to a tablet impacts laptop use. She has also collaborated with Stanford digital humanities scholars and Oxford data archivists to develop a visual graph query language to allow researchers to form queries on complex multidimensional data. Originally from Oklahoma, Rachel holds an MS in computer science from the University of Oklahoma, as well as an MS in design ethnography from the University of Dundee in Scotland. As is thematically appropriate for her adopted home of Portland, she likes cruciferous vegetables (especially kale) and occasionally brews beer.

Presentations

Objectivity is a myth: Your data is not objective, and neither are you Data 101

We often treat data as an impartial representation of reality—an unbiased delivery mechanism for "ground truth." Data collection and analysis systems, however, are designed by people: our knowledge, experience, and beliefs influence the design decisions we make and thus the data we collect. Rachel Shadoan explores how to adapt our processes to account for data's lack of objectivity.

Your TOS is not informed consent: Ethical experimentation for the Web DDBD

Informed consent is the backbone of ethical research involving human participants. In academic contexts, there are systems in place to protect human participants, but similar structures are lacking in the companies that drive the Web. Rachel Shadoan explains why adopting informed consent as an industry standard is vital, both ethically and for the validity of the research we do.

Saqib Shaikh is a software engineer at Microsoft, where he has worked for 10 years. Saqib has developed a variety of Internet-scale services and data pipelines powering Bing, Cortana, Edge, MSN, and various mobile apps. Being blind, Saqib is passionate about accessibility and universal design; he serves as an internal consultant for teams including Windows, Office, Skype, and Visual Studio and has spoken at several international conferences. Saqib has won three Microsoft hackathons in the past year. His current interests focus on the intersection between AI and HCI and the application of technology for social good.

Presentations

Beyond guide dogs: How advances in deep learning can empower the blind community Session

Anirudh Koul and Saqib Shaik explore cutting-edge advances at the intersection of vision, language, and deep learning that help the blind community "see" the physical world and explain how developers can utilize this state-of-the-art image-captioning and computer-vision technology in their own applications.

Paul Shannon is VP of technology at 7digital, the power behind innovative digital listening experiences, where he has helped grow the team and scale the API platform to support the changing technology and music landscape. He’s been responsible for building a team of data specialists, ensuring the organization, its customers, and suppliers are able to gain insight into the world of music services. Paul joined 7digital as a developer on the API team. Previously, he worked in the legal and mobile industries and was a pivotal member of the team that adopted agile, lean, and XP practices at vehicle finance and tech company Codeweavers Ltd. Paul is a regular conference speaker and tackles topics around process improvement, data-driven decision making, testing, recruitment, and team dynamics at universities and conferences around the world.

Presentations

What’s next for music services? The answer is in the data Session

Can our real-time distributed data systems help predict whether high-resolution audio is the future of digital music? What about content curation? Paul Shannon and Alan Hannaway explore the future of music services through data and explain why 7digital believes well-curated, high-resolution listening experiences are the future of digital music services.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Jonathan Seidman, Ted Malaska, and Gwen Shapira, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Hadoop application architectures: Fraud detection Tutorial

Jonathan Seidman, Mark Grover, Gwen Shapira, and Ted Malaska walk attendees through an end-to-end case study of building a fraud detection system, providing a concrete example of how to architect and implement real-time systems.

Office Hour with Gwen Shapira (Confluent) and Todd Palino (LinkedIn) Office Hours

Join Gwen Shapira, Todd Palino, and other Apache Kafka experts for a fast-paced conversation on Apache Kafka use cases, troubleshooting Apache Kafka, using Kafka in stream architectures, and when to avoid Kafka.

Putting Kafka into overdrive Session

Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Gwen Shapira and Todd Palino explain the right approach for getting the most out of Kafka, exploring how to monitor, optimize, and troubleshoot performance of your data pipelines from producer to consumer and from development to production.

When it absolutely, positively has to be there: Reliability guarantees in Kafka Session

Kafka provides the low latency, high throughput, high availability, and scale that financial services firms require. But can it also provide complete reliability? Gwen Shapira and Jeff Holoman explain how developers and operation teams can work together to build a bulletproof data pipeline with Kafka and pinpoint all the places where data can be lost if you're not careful.

Nigel Shardlow’s career has spanned psychology, artificial intelligence, academic philosophy, web design, product development, entrepreneurship, and consultancy. After leaving academia, Nigel spent the early part of his career in telecoms, leading product development teams at Orange (now EE) and BT. More recently, he worked as a behavior change consultant to organizations in the public and private sector, using hard evidence and science to measurably improve outcomes. Nigel brings insights from behavioral science, marketing science, and psychology into the model development process. His recent research interests include the meaning of loyalty, the nature of explanation, embedded cognition, and the relationship between brain science and ethnography.

Presentations

Doing data science to support strategic business decisions Data 101

Much of the business narrative around data science draws attention to the importance of prediction and predictive models. But in many business contexts, prediction alone is not enough. Thomas French and Nigel Shardlow explain that to support strategic decision making, data scientists must build models that can explain events in a meaningful way.

Ben Sharma is CEO and cofounder of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions and expertise ranging from development to production deployment in a wide array of technologies, including Hadoop, HBase, databases, virtualization, and storage. He has held technology leadership positions for NetApp, Fujitsu, and others. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes. He holds two patents.

Presentations

Building a modern data architecture Session

There are many factors to consider when building your data stack, but the architecture could be your biggest challenge. Yet it could also be the best predictor for success. Given the many elements to take into account and the lack of a proven playbook, Ben Sharma explains where you start to assemble your own best practices for building a scalable data architecture.

Office Hour with Ben Sharma (Zaloni) Office Hours

Ben is happy to answer questions about data lake design, development, and management. He can also offer insight into how to leverage data lakes for key aspects of risk data aggregation and reporting.

Risk data aggregation and risk reporting for financial services Session

Risk data aggregation and risk reporting (RDARR) is critical to compliance in financial services. Big data expert Ben Sharma explores multiple use cases to demonstrate how organizations in the financial services industry are building big data lakes that deliver the necessary components for risk data aggregation and risk reporting.

Chang She is a software engineer on Cloudera Navigator creating metadata management tools for Hadoop. Prior to joining Cloudera, Chang was cofounder and CTO of DataPad, a next-gen BI/analytics company. An early core contributor to the pandas library, Chang’s passion is creating data tools to make people more productive. Chang is a recovering financial quant with bachelor’s and master’s degrees in EECS from MIT.

Presentations

Don't build a data swamp: Hadoop governance case studies for financial services Session

Mark Donsky and Chang She explore canonical case studies that demonstrate how leading banks, healthcare, and pharmaceutical organizations are tackling Hadoop governance challenges head-on. You'll learn how to ensure data doesn't get lost, help users find and trust the data they need, and protect yourself against a data breach—all at Hadoop scale.

Jayant Shekhar is the founder of Sparkflows Inc., which enables machine learning on large datasets using Spark ML and intelligent workflows. Jayant focuses on Spark, streaming, and machine learning and is a contributor to Spark. Previously, Jayant was a principal solutions architect at Cloudera working with companies both large and small in various verticals on big data use cases, architecture, algorithms, and deployments. Prior to Cloudera, Jayant worked at Yahoo, where he was instrumental in building out the large-scale content/listings platform using Hadoop and big data technologies. Jayant also worked at eBay, building out a new shopping platform, K2, using Nutch and Hadoop among others, as well as KLA-Tencor, building software for reticle inspection stations and defect analysis systems. Jayant holds a bachelor’s degree in computer science from IIT Kharagpur and a master’s degree in computer engineering from San Jose State University.

Presentations

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX Tutorial

Jayant Shekhar, Vartika Singh, and Krishna Sankar explore techniques for building machine-learning apps using Spark ML as well as the principles of graph processing with Spark GraphX.

Tomer Shiran is cofounder and CEO of Dremio. Previously, Tomer was the VP of product at MapR, where he was responsible for product strategy, roadmap, and new feature development. As a member of the executive team, he helped grow the company from 5 employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He is the author of five US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion, the Israel Institute of Technology.

Presentations

Analyzing dynamic JSON with Apache Drill Session

Modern data is often messy and does not fit into the old schema-on-write or even the newer schema-on-read paradigms. Some data effectively has no schema at all. Tomer Shiran explores how to analyze such data with Drill, covering Drill’s internal architecture and explaining how type introspection can be used to query JSON and JSON-structured data—such as data in MongoDB—without requiring a schema.

BI on Hadoop: What are your options? Session

There are (too?) many options for BI on Hadoop. Some are great at exploration, some are great at OLAP, some are fast, and some are flexible. Understanding the options and how they work with Hadoop systems is a key challenge for many organizations. Tomer Shiran provides a survey of the main options, both traditional (Tableau, Qlik, etc.) and new (Platfora, Datameer, etc.).

Emil A. Siemes is a long-term Java veteran interested in building, running, and managing the next generation of data-driven web and mobile applications. After several years with Sun, Aplix, Wily, and SpringSource (VMware), Emil joined Hortonworks, where he helps customers modernize their data architectures with Hadoop.

Presentations

The IoT with Apache NiFi and Hadoop: Better together Session

The Internet of Things and big data analytics are currently two of the hottest topics in IT. But how do you get started using them? Emil Andreas Siemes and Stephan Anné demonstrate how to use Apache NiFi to ingest, transform, and route sensor data into Hadoop and how to do further predictive analytics.

Vartika Singh is a solutions consultant at Cloudera. Previously, Vartika was a data scientist applying machine-learning algorithms to real-world use cases, ranging from clickstream to image processing. She has 10 years of experience designing and developing solutions and frameworks utilizing machine-learning techniques.

Presentations

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX Tutorial

Jayant Shekhar, Vartika Singh, and Krishna Sankar explore techniques for building machine-learning apps using Spark ML as well as the principles of graph processing with Spark GraphX.

Always attracted to solving real world problems involving complex dynamical systems, Matthew Smith initially trained as an ecologist before undertaking an applied mathematics PhD to up-skill in quantitative techniques before joining the Computational Science Laboratory at Microsoft Research, Cambridge. Matthew has become renowned for completing extremely difficult predictive analytics research, principally using prototype research software. He now applies those skills to solve real-world data science problems.

Presentations

Data science++: Improving data science by adding domain understanding HDS

Matthew Smith demonstrates how to gain unexpectedly high predictive accuracy, new insights for the domain experts and customers into the functioning of the system, and computationally efficient prediction algorithms, in applications such as predicting crops, global carbon emissions, diseases, ecosystems, species distributions, weather, roads, and riots.

Emily Sommer is a data engineer at Etsy. She brings a wealth of practical knowledge and can-do attitude to both her team and the students she tutors through ScriptEd.

Presentations

The Bag of Little Bootstraps: A/B experimenting with big data made small Session

Bootstrapping is a statistical technique that resamples data many times over—an effective method for determining confidence in A/B test results but an expensive procedure in a world of big data. Emily Sommer explains how Etsy implemented the Bag of Little Bootstraps, a clever take on bootstrapping that involves examining many smaller subsets of one's data.

.

Presentations

Demonstrating the art of the possible with Spark and Hadoop Session

Apache Spark is on fire. Over the past five years, more and more organizations have looked to leverage Spark to operationalize their teams and the delivery of analytics to their respective businesses. Adrian Houselander and Joy Spohn demonstrate two use cases of how Apache Spark and Apache Hadoop are being used to harness valuable insights from complex data across cloud and hybrid environments.

Alessandra Staglianò is a data scientist who has worked on multiple complex projects. In addition to various machine-learning techniques, Alessandra’s expertise is in extracting relevant information from noisy and redundant data. Her former research work has been published in a variety of journals. Alessandra holds a PhD in computer science specializing in machine learning and machine vision.

Presentations

Detecting anomalies in the real world HDS

Anomaly detection is a hot topic in data and can be applied to various fields. Anomaly detection faces challenges common to all big data projects but also deals with higher uncertainty and more difficult measurements, all while operating in real time. Alessandra Staglianò explains how those challenges translate to the real world and how to overcome them with the latest data science tools.

Rupert Steffner is chief platform architect of Otto Group’s new business intelligence platform, BRAIN. In this role, Rupert is responsible for the entire setup as well as for initiating and managing the major change projects. Previously, he was head of the Marketing Department at the University of Applied Sciences, Salzburg and worked as a business intelligence leader for several European and US companies in a range of industries from ecommerce and retail to finance and telco. Rupert has over 25 years of experience in designing and implementing highly sophisticated technical and business solutions with a focus on customer-centric marketing. He holds an MBA from WU Vienna.

Presentations

Otto’s little army of real-time bots: How online retailers can defend shopping carts and retarget customers in real time DDBD

The latest research shows that time to manage customers is critical to success. Otto has developed a whole set of real-time applications to manage customers at interaction time. Rupert Steffner highlights the business metrics and the application architecture and outlines a real-time data management model developers of any interactive business intelligence application can use.

Carl Steinbach is a senior staff software engineer at LinkedIn, where he leads the Hadoop Platform team. Carl is also a member of LinkedIn’s Technology Leadership Group and its Open Source Committee. Before joining LinkedIn, Carl was an early employee at Cloudera. He is an ASF member and former PMC chair of the Apache Hive Project.

Presentations

Scaling out to 10 clusters, 1,000 users, and 10,000 flows: The Dali experience at LinkedIn Session

Carl Steinbach offers an overview of Dali, LinkedIn's collection of libraries, services, and development tools that are united by the common goal of providing a dataset API for Hadoop.

Jamie Stone is Vice President of Anomali, EMEA. Jamie has over 20 years’ experience in enterprise software management and systems. Prior to Anomali. he held positions at Arcsight and Cloudera. Jamie holds an MBA from Warwick University.

Presentations

Apache Hadoop meets cybersecurity Keynote

The cybersecurity landscape is quickly changing, and Apache Hadoop is becoming the analytics and data management platform of choice for cybersecurity practitioners. Tom Reilly explains why organizations are turning toward the open source ecosystem to break down traditional cybersecurity analytics and data constraints in order to detect a new breed of sophisticated attacks.

Brian Suda is a master informatician currently residing in Reykjavík, Iceland. Since first logging on in the mid-’90s, he has spent a good portion of each day connected to the internet. When he is not hacking on microformats or writing about web technologies, he enjoys taking kite aerial photography. His own little patch of internet can be found at Suda.co.uk, where many of his past projects, publications, interviews, and crazy ideas can be found.

Presentations

Introduction to visualizations using D3 Tutorial

Visualizations are a key part of conveying any dataset. Brian Suda explains what good data visualizations are and how you can build them using D3, the most popular, easiest, and most extensible way to get your data online in an interactive way.

Suresh Duddi (DP to his friends and coworkers) is a tech wizard with over 25 years of experience in Silicon Valley. He is currently a VP of product and engineering at Yahoo, where he leads a team that builds analytics for Yahoo.com while managing a petabyte of data. DP is also no stranger to the startup world; he cofounded Habitera, where he focused on creating solutions for health using behavioral economics, and worked for both LiveOps and Simply Hired. He’s worked in engineering leadership positions at pioneering companies including Netscape, where he invented the Internet. :-) When you talk with DP, you’ll recognize his deep passion for technology. In his free time, you’ll find him playing volleyball, discovering local South Indian restaurants to satisfy his cravings, and baking healthy cakes with his daughters.

Presentations

Big SQL: The future of in-cluster analytics and enterprise adoption Session

Hear why big SQL is the future of analytics. Experts at Yahoo, Knewton, FullStack Analytics, and Looker discuss their respective data architectures, the trials and tribulations of running analytics in-cluster, and examples of the real business value gained from putting their data in the hands of employees across their companies.

From writing the first application server for the web to designing the first crowdsourced ecosystem, Lloyd Tabb has spent the last 25 years revolutionizing how the world uses the Internet and, by extension, data. As cofounder and CEO of Looker, Lloyd combines his passion for data exploration and discovery, his love of programming languages, and his commitment to developing and nurturing talent and change the face of the business intelligence market. Originally a database and languages architect at Borland International, Lloyd left to found Commerce Tools (acquired by Netscape in 1995). At Netscape, he became the principal engineer on Netscape Navigator Gold, led several releases of Communicator, and helped define the creation of Mozilla.org. Prior to founding Looker, Lloyd was the CTO of LiveOps, cofounder of Readyforce, and advisor to Luminate, recently acquired by Yahoo.

Presentations

Big SQL: The future of in-cluster analytics and enterprise adoption Session

Hear why big SQL is the future of analytics. Experts at Yahoo, Knewton, FullStack Analytics, and Looker discuss their respective data architectures, the trials and tribulations of running analytics in-cluster, and examples of the real business value gained from putting their data in the hands of employees across their companies.

David Talby is Atigeo’s chief technology officer, working to evolve its big data analytics platform to solve real-world problems in healthcare, energy, and cybersecurity. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies Session

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Jaan Tallinn is an Estonian programmer who participated in the development of Skype in 2002 and FastTrack/Kazaa, a file-sharing application, in 2000. Jaan is partner and cofounder of the development company Bluemoon, which created the game SkyRoads. He graduated from the University of Tartu with a BSc in theoretical physics; his thesis involved traveling interstellar distances using warps in space-time. Jaan is a former member of the Estonian President’s Academic Advisory Board, as well as a founder of the Centre for the Study of Existential Risk, the Future of Life Institute, and the personalized medical research company MetaMed.

Presentations

Panel: The future of intelligence Session

Stuart Russell and Jaan Tallinn explore and debate the future of artificial intelligence in a panel discussion moderated by Marc Warner.

Jordan Tigani has more than 15 years of professional software development experience, the last four of which have been spent building BigQuery. Prior to joining Google, Jordan worked at a number of star-crossed startups, where he learned to make data-based predictions. He is a coauthor of Google BigQuery Analytics. When not analyzing soccer matches, he can often be found playing in one.

Presentations

Big data at Google: Solving problems at scale Keynote

Google is no stranger to big data, pioneering several big data technologies grown and tested internally—including MapReduce, BigTable, and most recently Dataflow and TensorFlow, as well as one of the most heavily used tools at Google, BigQuery—and making them available to everyone. Jordan Tigani shares what big data means for Google and announces several new BigQuery features.

Pushing the limits of Google BigQuery Session

Data sizes are getting larger all the time. Querying terabytes just isn't cool anymore; now you need to query petabytes. Jordan Tigani puts BigQuery to the test by performing interactive analytics against a 1 PB dataset, showcasing the exciting new features that make this process easy, fast, and affordable, and demonstrates the simplicity of managing your petabyte-scale data with "NoOps".

Fergal Toomey is a specialist in network data analytics and a founder of Corvil, where he has been intensively involved in developing key product innovations directly applicable to managing IT system performance. Fergal has been involved in the design and development of innovative measurement and analysis algorithms for the past 12 years. Previously, he was an assistant professor at the Dublin Institute for Advanced Studies, where he was a member of the Applied Probability Group, which also included Raymond Russell, Corvil’s CTO. Fergal holds an MSc in physics and a PhD in applied probability theory, both from Trinity College, Dublin.

Presentations

Using Spark and Hadoop in high-speed trading environments Session

Fergal Toomey and Pierre Lacave demonstrate how to effectively use Spark and Hadoop to reliably analyze data in high-speed trading environments across multiple machines in real time.

Deenar Toraskar is a cofounder of Think Reactive, which provides a responsive, resilient, elastic, ready-to-go risk analytics solutions based on Spark. Think Reactive has worked with various banks to implement market and credit risk calculation platforms including the new FRTB calculations see (http://goo.gl/8bJY3Q). The solution features an intuitive interactive notebook interface and is backed by a high definition warehouse to provide deep insight.

Previously, Deenar worked at a Tier 1 investment bank, leading a team developing risk analytics applications (with numerous passionate and satisfied users) on a Spark/Hadoop platform. Deenar is also a Apache Spark contributor, a keen cyclist, and a proud parent.

Presentations

Simple, fast, and flexible risk aggregation in Hadoop Session

Value at risk (VaR) is a widely used risk measure. VaR is not simply additive, which provides unique challenges to report VaR at any aggregate level, as traditional database aggregation functions don't work. Deenar Toraskar explains how the Hive complex data types and user-defined functions can be used very effectively to provide simple, fast, and flexible VaR aggregation.

Hoa Tram is currently a partner solutions architect for OSIsoft. Hoa has worked in the software industry for nearly 20 years and has been with OSIsoft for the last 10. Attached to the company headquarters in San Leandro, CA, Hoa has been the development lead for the embedded Linux version of the PI Historian, the high-speed core communications layer of the PI system, and the OEM initiatives of OSI’s strategic partners. Hoa is an expert in the areas of data acquisition from the edge and analysis of that data in private and public cloud environments, working directly with OSI’s major equipment manufacturing and analytics partners to provide solutions to OSI’s customers in the industrial space ranging from increasing overall equipment effectiveness to cybersecurity.

Presentations

Industrial big data and sensor time series data: Different but not difficult Session

For decades, industrial manufacturing has dealt with large volumes of sensor data and handled a variety of data from the various manufacturing operations management (MOM) systems in production, quality, maintenance, and inventory. Gopal GopalKrishnan and Hoa Tram offer lessons learned from applying big data ecosystem tools to oil and gas, energy, utilities, metals, and mining use cases.

Marton Trencseni is a data engineer at Facebook in London. An engineer and physicist by training, Marton worked in various engineering roles before launching his own distributed database startup. Then Marton joined Prezi and spent three years as director of data analytics, building out Prezi’s data platform and data team. Marton regularly shares stories and practical knowledge with the data community at conferences.

Presentations

Beautiful A/B testing Session

At first glance A/B testing is a simple matter: take a few numbers, put them into an online calculator, and read off the statistical significance. But in fact it's a complex topic with amazing opportunities (and pitfalls) for organizations. Marton Trencseni offers a deep dive into A/B testing to provide attendees the information needed to improve their organizations' experimentation cultures.

None

Presentations

Insight at the speed of thought: Visualizing and exploring data at scale Session

Nick Turner offers an insightful view on how technology is delivering self-service analytics through visualization and enabling business users to quickly explore their data at scale.

Kostas Tzoumas is a PMC member of the Apache Flink project and cofounder of data Artisans, the company founded by the original development team that created Flink. Kostas has spoken extensively about Flink, including at Hadoop Summit San Jose 2015.

Presentations

Enabling new streaming applications with Apache Flink Session

Data stream processing is emerging as a new paradigm for the data infrastructure. Streaming promises to unify and simplify many existing applications while simultaneously enabling new applications on both real-time and historical data. Stephan Ewen and Kostas Tzoumas introduce the data streaming paradigm and show how to build a set of simple but representative applications using Apache Flink.

Mona Vernon is vice president of Thomson Reuters Labs, which partners with customers and third parties, such as startups and academics, on new data-driven innovations. Previously at Thomson Reuters, Mona ran the Emerging Technology group and launched an Open Innovation Challenge program across the enterprise. Prior to joining Thomson Reuters, she held product development and management roles in technology startups. Mona is an executive board member of the FinTech Sandbox in Boston, an advisory board member of the Commonwealth of Massachusetts Big Data Advisory Committee, and winner of the Boston 50 on Fire. Mona holds a BS and an MS in mechanical engineering from Tufts University and an SM in engineering and management from MIT, where her research focused on the role of customer experience in digital business strategy.

Presentations

Data wants to be shareable Keynote

Data has more potential value when it can be shared. In order to monetize data, it must first be made shareable: shareable data is an asset that can be sold, traded, or used to create new data marketplaces. Mona Vernon outlines a framework to structure thinking about data shareability and monetization and explores these new business opportunities.

Every business is a data business DDBD

How many ways can data be monetized? Data fuels better financial outcomes for a firm. Mona Vernon explains how data-driven decision making drives better customer-experience design and more efficient operations and why data is also an asset that can be sold, traded, or used to create new marketplaces.

Intel Snr. Principal engineer, Tim has been with Intel in IT Engineering for 14 years and has over 24 years of network engineering experience. He has been focused mainly on WAN, Internet, telephony, video and network security where over the years he has been instrumental in architectures, design and operations. Tim has been active on various technical advisory bodies outside of Intel for many years.

Presentations

Apache Hadoop meets cybersecurity Keynote

The cybersecurity landscape is quickly changing, and Apache Hadoop is becoming the analytics and data management platform of choice for cybersecurity practitioners. Tom Reilly explains why organizations are turning toward the open source ecosystem to break down traditional cybersecurity analytics and data constraints in order to detect a new breed of sophisticated attacks.

Kai Voigt is a senior instructor for Hadoop classes at Cloudera, delivering training classes for developers and administrators worldwide. Kai held the same role at MySQL, Sun, and Oracle. He has spoken at a number of O’Reilly conferences.

Presentations

Data science at scale: Using Spark and Hadoop Training

Learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Through in-class simulations and exercises, Kai Voigt walks attendees through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field.

Data science at scale: Using Spark and Hadoop (Day 2) Training Day 2

Learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Through in-class simulations and exercises, Kai Voigt walks attendees through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field.

Introduction to Apache Spark for Java and Scala developers Session

Ted Malaska leads an introduction to basic Spark concepts such as DAGs, RDDs, transformations, actions, and executors, designed for Java and Scala developers. You'll learn how your mindset must evolve beyond Java or Scala code that runs in a single JVM as you explore JVM locality, memory utilization, network/CPU usage, optimization of DAGs pipelines, and serialization conservation.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the creation of the Lightbend Fast Data Platform, a streaming data platform built on the Lightbend Reactive Platform, Kafka, Spark, Flink, and Mesosphere DC/OS. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects and he is the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Office Hour with Dean Wampler (Lightbend) Office Hours

Dean will discuss all things Spark, stream processing, and deployment platforms, such as Mesos and Hadoop.

Scala: The unpredicted lingua franca for data science Session

Andy Petrella and Dean Wampler explore what it means to do data science today and why Scala succeeds at coping with large and fast data where older languages fail. Andy and Dean then discuss the current ongoing projects in advanced data science that use Scala as the main language, including Splash, mic-cut problem, OptiML, needle (DL), ADAM, and more.

With more than 15 years’ experience working with designers, engineers, and scientists, Tricia Wang has a particular interest in designing human-centered systems. Tricia advises organizations on integrating big data and what she calls “thick data”—data brought to light using digital-age ethnographic research methods that uncover emotions, stories, and meaning—to improve strategy, policy, products, and services. Organizations she has worked with include P&G, Nokia, GE, Kickstarter, the United Nations, and NASA. Tricia recently finished an expert-in-residency at IDEO, where she extended and amplified IDEO’s impact in design research. When not working with organizations, she spends the other half of her life researching online anonymity and the bias towards the quantifiable. Recognized as a leading authority on applied research, human-centered design, social media, and Chinese Internet culture, Tricia’s work and points of view have been featured in Slate, the Atlantic, Al Jazeera, Fast Company, Makeshift, and Wired. A sought-after speaker, she has given talks at conferences such as Lift, Strata, IxDA, Webstock, and South by Southwest. She has also spoken at Wrigley, P&G, Nike, 21st Century Fox, Tumblr, and various investment firms.

Tricia began her career as a documentary filmmaker, an HIV/AIDS activist, a hip-hop education advocate, and a technology educator in low-income communities. She has worked across four continents; her life philosophy is that you have to go to the edge to discover what’s really happening. She’s the proud companion of dog #ellethedog. She also oversees Ethnography Matters, a site that publishes articles about applied ethnography and technology to a public audience. Tricia has a BA in communication and PhD in sociology from UC San Diego. She holds affiliate positions at Data and Society, Harvard University’s Berkman Center for Internet & Society, and New York University’s Interactive Telecommunications Program (ITP). She is also a Fulbright Fellow and National Science Foundation Fellow, where she is the first Western scholar to work with the China Internet Network Information Center (CNNIC) in Beijing, China.

Presentations

Prophecies and predictive models: How a 3D approach to data transforms your business Keynote

The famous Oracle at Delphi had a secret: Its prophecies were interpreted by Temple Guides, using a very early version of ethnographic research. With today’s near-blind faith in the predictive power of Big Data, it’s time to take a lesson from the Ancient Greeks.

Simon Wardley is a researcher for the Leading Edge Forum focused on the intersection of IT strategy and new technologies. Simon is a seasoned executive who has spent the last 15 years defining future IT strategies for companies in the FMCG, retail, and IT industries—from Canon’s early leadership in the cloud-computing space in 2005 to Ubuntu’s recent dominance as the top cloud operating system. As a geneticist with a love of mathematics and a fascination for economics, Simon has always found himself dealing with complex systems, whether in behavioral patterns, the environmental risks of chemical pollution, developing novel computer systems, or managing companies. He is a passionate advocate and researcher in the fields of open source, commoditization, innovation, organizational structure, and cybernetics.

Simon’s most recent published research, “Clash of the Titans: Can China Dethrone Silicon Valley?,” assesses the high-tech challenge from China and what this means to the future of global technology industry competition. His previous research covers topics including the nature of technological and business change over the next 20 years, value chain mapping, strategies for an increasingly open economy, Web 2.0, and a lifecycle approach to cloud computing. Simon is a regular presenter at conferences worldwide and has been voted one of the UK’s top 50 most influential people in IT in Computer Weekly’s 2011 and 2012 polls.

Presentations

Situational awareness: On the importance of mapping DDBD

Simon Wardley examines the level of situational awareness within business, why it matters, and whether we can anticipate and exploit change before it hits us. Is it simply a lack of data, or are we not looking at the right things?

Marc Warner is the CEO of ASI. Previously, Marc held a research fellowship in physics at Harvard University, where he studied quantum metrology and quantum computing. His PhD research, in the field of quantum computing, was published in Nature and covered in the New York Times. Marc has worked in finance and consulted in numerous roles for a range of clients, including the UK Houses of Parliament, the NHS, the BBC, and various startups on talent selection and data-driven decision making.

Presentations

AI for business: A hands-on introduction to what machine learning can do Tutorial

In a hands-on tutorial designed for executives, product managers, and business leaders, Marc Warner explores what's possible (and not) with machine learning and what that means for businesses. Attendees will gain experience with cutting-edge artificial intelligence by building their very own handwriting recognition engine. No technical background required.

Panel: The future of intelligence Session

Stuart Russell and Jaan Tallinn explore and debate the future of artificial intelligence in a panel discussion moderated by Marc Warner.

Melanie Warrick is a senior developer advocate at Google with a passion for machine learning problems at scale. Melanie’s previous experience includes work as a founding engineer on Deeplearning4j and as a data scientist and engineer at Change.org.

Presentations

Deep learning and natural language processing with Spark Session

Deep learning is taking data science by storm, due to the combination of stable distributed computing technologies, increasing amounts of data, and available computing resources. Andy Petrella and Melanie Warrick show how to implement a Spark­-ready version of the long short­-term memory (LSTM) neural network, widely used in the hardest natural language processing and understanding problems.

What is AI? Data 101

What is AI really? Is it simply a technology that mimics human intelligence or something more? Are the robots coming to destroy us, save us, or both? Melanie Warrick explores the definition of artificial intelligence and seeks to clarify what AI will mean for our world.

Felix Werkmeister is one of the lead developers within Continental’s eHorizon project, where he is mainly responsible for Spark, Hadoop, and Storm development. Before joining Continental, Felix worked as developer and member of the IoT Competence Team for Opitz-Consulting GmbH.

Presentations

Year 2025: Big data as enabler of fully automated vehicles Session

Experience tells us a decision is only as good as the information it is based on. The same is true for driving. The better a vehicle knows its surroundings, the better it can support the driver. Information makes vehicles safer, more efficient, and more comfortable. Thomas Beer and Felix Werkmeister explain how Continental exploits big data technologies for building information-driven vehicles.

Tom White is one of the foremost experts on Hadoop. Tom is a data scientist at Cloudera, where he has worked since its foundation on the core distributions from Cloudera and Apache. Previously, he was an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has been an Apache Hadoop committer since February 2007 and is a member of the Apache Software Foundation. His book Hadoop: The Definitive Guide (O’Reilly) is recognized as the leading reference on the subject. He has written numerous articles for O’Reilly, Java.net, and IBM’s developerWorks and has spoken at several conferences including ApacheCon, OSCON, and Strata + Hadoop World. Tom has a bachelor’s degree in mathematics from the University of Cambridge and a master’s in philosophy of science from the University of Leeds, UK.

Presentations

Petascale genomics Session

The advent of next-generation DNA sequencing technologies is revolutionizing life sciences research by routinely generating extremely large datasets. Tom White explains how big data tools developed to handle large-scale Internet data (like Hadoop) help scientists effectively manage this new scale of data and also enable addressing a host of questions that were previously out of reach.

The next 10 years of Apache Hadoop Session

Ben Lorica hosts a conversation with Hadoop cofounder Doug Cutting and Tom White, an early user and committer of Apache Hadoop.

Thomas Wiecki is the lead data science researcher at Quantopian, where he uses probabilistic programming and machine learning to help build the world’s first crowdsourced hedge fund. Among other open source projects, he is involved in the development of PyMC—a probabilistic programming framework written in Python. A recognized international speaker, Thomas has given talks at various conferences and meetups across the US, Europe, and Asia. He holds a PhD from Brown University.

Presentations

Predicting out-of-sample performance of a large cohort of trading algorithms with machine learning Session

Thomas Wiecki explores the prevalence of backtest overfitting and debunks several common myths in quantitative finance based on empirical findings. Thomas demonstrates how he trained a machine-learning classifier on Quantopian's huge and unique dataset of over 800,000 trading algorithms to predict if an algorithm is overfit and how its future performance will likely unfold.

Martin Willcox leads Teradata’s International Big Data CoE, a team of data scientists and technology and architecture consultants charged with helping customers realize value from their analytic data assets and articulating Teradata’s big data strategy, as well as the nature, value, and differentiation of Teradata’s technology and solution offerings, to prospective customers, analysts, and media organizations across the international region. Martin has 19 years of experience in the IT industry and has worked for five organizations, including two major grocery retailers. Since joining Teradata, Martin has worked in solution architecture, enterprise architecture, demand generation, technology marketing, and management roles. As a former Teradata customer—Martin was the data warehouse manager at Co-operative Retail (UK) and later the senior data architect at Co-operative Group—Martin understands the analytics landscape and marketplace from the twin perspectives of an end-user organization and a technology vendor.

Martin is an infrequent contributor to the Teradata International and Forbes Teradata Voice blogs. He holds a BSc (with honors) in physics and astronomy from the University of Sheffield and a postgraduate certificate in computing for commerce and industry from the Open University. He is married with three children and is a lapsed supporter of Sheffield Wednesday Football Club. In his spare time, Martin enjoys playing with technologies like Teradata Aster, Python, and R, flying gliders, listening to guitar music, and watching his sons play rugby and rock climb.

Presentations

The Internet of Things: It’s the (sensor) data, stupid Keynote

Martin Willcox shares the lessons he's learned from successful Teradata IoT projects about about how to manage and leverage sensor data and explains why data management, data integration, and multigenre analytics are foundational to driving business value from IoT initiatives.

Pete Williams is one of the UK’s top data leaders and influencers. Pete is a passionate advocate of data-driven thinking and a thought leader on the use of analytics to disrupt organizations. In his role as head of enterprise analytics at Marks and Spencer, Pete’s remit includes building an analytic community to empower M&S’s data-driven future.

Presentations

Building a unicorn: Creating a data-driven culture at Marks and Spencer DDBD

Starting a big data journey by bagging a unicorn and corralling it in your newly acquired big data stable will not necessarily lead to success or lasting change. So how do you drive value from big data? Drawing on first-hand experience at Marks and Spencer, Pete Williams shares practical examples and advice on how to take your data culture and capability from walk through trot to gallop.

Gary Willis is a data scientist at ASI with a diverse background in applying machine-learning techniques to commercial data science problems. Gary holds a PhD in statistical physics; his research looked at Markov Chain Monte Carlo simulations of complex systems.

Presentations

Removing human bias from the interview process Session

Applying a data-driven approach to the recruitment process has long been an aspirational goal for many organizations. In recent years, through the use of data science, it has become a genuine reality. Gary Willis explains how data science and, more importantly, an intelligent approach to interview design have enabled companies to start identifying unconscious bias in their recruitment process.

Ian Wrigley is the director of education services at Confluent, where he heads the team building and delivering courses focused on Apache Kafka and its ecosystem. Over his 25-year career, Ian has taught tens of thousands of students in subjects ranging from C programming to Hadoop development and administration.

Presentations

A hands-on introduction to Apache Kafka Tutorial

Ian Wrigley leads a hands-on workshop on leveraging the capabilities of Apache Kafka to collect, manage, and process stream data for both big data projects and general-purpose enterprise data integration, covering key architectural concepts, developer APIs, use cases, and how to write applications that publish data to, and subscribe to data from, Kafka. No prior knowledge of Kafka is required.

Ask me anything: Apache Kafka Ask Me Anything

Ian Wrigley, Neha Narkhede, and Flavio Junqueira field a wide range of detailed questions about Apache Kafka. Even if you don’t have a specific question, join in to hear what others are asking.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud strategy and solutions. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Presentations

Being successful with Apache Hadoop in the cloud Session

Jennifer Wu outlines concepts for successfully running Hadoop in the cloud, provides guidance on selecting cloud storage, covers real-world examples of Hadoop deployment patterns in public clouds, and demos Cloudera Director provisioning on AWS.

Itay Yogev is Intel’s IT director for Big Data Analytics. Itay has founded and is managing the Advanced Analytics competency center of Intel as of 2009. In addition, Itay is responsible for the corporate advanced analytics strategy and its execution. He has over 15 years of experience in leadership positions in the domain of analytics.

Presentations

How to build a big data analytics competency center Session

Big data analytics brings value to enterprises, helping them achieve operational excellence. The big question is how you implement it. Drawing on firsthand experience, Assaf Araki and Itay Yogev share how Intel built a big data analytics competency center, exploring the key elements that help Intel grow its people and capabilities and the challenges and lessons learned.

Edward Zhang is the core developer and architect of Apache Eagle. Edward has been developing various monitoring applications for big data systems at eBay for a few years now. He is very knowledgeable in distributed systems.

Presentations

Apache Eagle: Secure Hadoop in real time Session

Apache Eagle is an open source monitoring solution to instantly identify access to sensitive data, recognize malicious activities, and take action. Arun Karthick Manoharan, Edward Zhang, and Chaitali Gupta explain how Eagle helps secure a Hadoop cluster using policy-based and machine-learning user-profile-based detection and alerting.