Speaker slides: Data science + business analytics training: Strata Data Conference

Rob Thomas (IBM), Tim O'Reilly (O'Reilly Media)

AI has the potential to add $16 trillion global economy by 2030, but adoption has been slow. While we understand the power of AI, many of us aren’t sure how to fully unleash its potential. Join Robert Thomas and Tim O'Reilly to learn that the reality is AI isn't magic. It’s hard work.

An introduction to machine learning on graphs

David Mack (Octavian)

View slides

Graphs are a powerful way to represent knowledge. Organizations, in fields such as biosciences and finance, are starting to amass large knowledge graphs, but they lack the machine learning tools to extract insights from them. David Mack offers an overview of what insights are possible and surveys the most popular approaches.

Apache Hadoop 3.x state of the union and upgrade guidance

Wangda Tan (Cloudera), Wei-Chiu Chuang (Cloudera)

Download slides (PDF)

Wangda Tan and Wei-Chiu Chuang outline the current status of Apache Hadoop community and dive into present and future of Hadoop 3.x. You'll get a peak at new features like erasure coding, GPU support, NameNode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. And they walk you through upgrade guidance from 2.x to 3.x.

Architecting a data analytics service both in the public cloud and in the on-premise private cloud: ETL, BI, and machine learning (sponsored by SK Holdings)

Jungwook SEo (SK Holdings)

Download slides (PDF)

Jungwook Seo walks you through a data analytics platform in the cloud by the name of AccuInsight+ with eight data analytic services in the CloudZ (one of the biggest cloud service providers in Korea), which SK Holdings announced in January 2019.

Architecting a data platform for enterprise use

Mark Madsen (Teradata), Todd Walter (Archimedata)

Download slides (PDF)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Assumed risk versus actual risk: The new world of behavior-based risk modeling

Viridiana Lourdes (Ayasdi)

Download slides (PDF)

Viridiana Lourdes explains how banks and financial enterprises can adopt and integrate actual risk models with existing systems to enhance the performance and operational efficiency of the financial crimes organization. Join in to learn how actual risk models can reduce segmentation noise, utilize unlabeled transactional data, and spot unusual behavior more effectively.

Automating ML model training and deployments via metadata-driven data, infrastructure, feature engineering, and model management

Mumin Ransom (Comcast), Nick Pinckernell (Comcast)

Download slides (1-PDF)

Download slides (2-ZIP)

Mumin Ransom gives an overview of the data management and privacy challenges around automating ML model (re)deployments and stream-based inferencing at scale.

Building a multitenant data processing and model inferencing platform with Kafka Streams

Navinder Pal Singh Brar (Walmart Labs)

Download slides (PPTX)

Each week 275 million people shop at Walmart, generating interaction and transaction data. Navinder Pal Singh Brar explains how the customer backbone team enables extraction, transformation, and storage of customer data to be served to other teams. At 5 billion events per day, the Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer.

Building an AI platform: Key principles and lessons learned

Moty Fania (Intel)

Download slides (PDF)

Moty Fania details Intel’s IT experience of implementing a sales AI platform. This platform is based on streaming, microservices architecture with a message bus backbone. It was designed for real-time data extraction and reasoning and handles the processing of millions of website pages and is capable of sifting through millions of tweets per day.

Cisco Data Intelligence Platform (sponsored by Cisco)

Siva Sivakumar (Cisco)

Watch the keynote

Siva Sivakumar explains the Cisco Data Intelligence Platform (CDIP), which is a cloud-scale architecture that brings together big data, AI and compute farm, and storage tiers to work together as a single entity, while also being able to scale independently to address the IT issues in the modern data center.

Clean the swamp: Gain greater visibility, speed, and governance with data ops (sponsored by Hitachi Vantara)

Chuck Yarbrough (Hitachi Vantara)

Download slides (PDF)

According to Gartner, over 80% of data lake projects were deemed inefficient. Data lakes come and go. Swamps happen. Data agility is fleeting. Chuck Yarbrough walks you through how data ops practices and a modern data architecture bring greater visibility and allow faster data access with proper governance.

Data science and the business of Major League Baseball

Aaron Owen (Major League Baseball), Matthew Horton (Major League Baseball), Josh Hamilton (Major League Baseball)

Download slides (PDF)

Using SAS, Python, and AWS SageMaker, Major League Baseball's (MLB's) data science team outlines how it predicts ticket purchasers’ likelihood to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.

Data security and privacy anti-patterns

Steven Touw (Immuta)

Download slides (PDF)

Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past four years, data security and privacy anti-patterns have emerged across hundreds of customers and industry verticals—there's been an obvious trend. Steven Touw details five anti-patterns and, more importantly, the solutions for them.

Data sonification: Making music from the yield curve

Alan Smith (Financial Times)

Watch the keynote

Based on a critical evaluation of the iconic yield curve chart, Alan Smith argues that combining visualization (data to pixels) with sonification (data to pitch) offers potential to improve not only aesthetic multimedia experiences but also an opportunity to take the presentation of data into the rapidly expanding universe of screenless devices and products.

Deep learning from scratch

Bruno Goncalves (Data For Science)

View slides

You'll go hands-on to learn the theoretical foundations and principal ideas underlying deep learning and neural networks. Bruno Gonçalves provides the code structure of the implementations that closely resembles the way Keras is structured, so that by the end of the course, you'll be prepared to dive deeper into the deep learning applications of your choice.

Deep learning methods for natural language processing

Garrett Hoffman (StockTwits)

View slides

Download slides (2-PDF)

Garrett Hoffman walks you through deep learning methods for natural language processing and natural language understanding tasks, using a live example in Python and TensorFlow with StockTwits data. Methods include Word2Vec, recurrent neural networks (RNNs) and variants (long short-term memory [LSTM] and gated recurrent unit [GRU]), and convolutional neural networks.

Deep learning on Apache Spark at CERN’s Large Hadron Collider with Analytics Zoo

Sajan Govindan (Intel)

Download slides (PDF)

Sajan Govindan outlines CERN’s research on deep learning in high energy physics experiments as an alternative to customized rule-based methods with an example of topology classification to improve real-time event selection at the Large Hadron Collider. CERN uses deep learning pipelines on Apache Spark using BigDL and Analytics Zoo open source software on Intel Xeon-based clusters.

Delivering the enterprise data cloud

Arun Murthy (Cloudera )

Watch the keynote

In this keynote, we’ll introduce you to the new 100% open source Cloudera Data Platform (CDP), the world’s first enterprise data cloud. CDP is hybrid and multi-cloud, delivering the speed, agility, and scale you need to secure and govern your data anywhere from the edge to AI.

Democratization of data science: Using machine learning to build credit risk models

Moto Tohda (Tokyo Century (USA))

Download slides (PDF)

Tokyo Century was ready for a change. Credit risk decisions were taking too long and the home office was taking notice. The company needed a full stack data solution to increase the speed of loan authorizations, and it needed it quickly. Moto Tohda explains how Tokyo Century put data at the center of its credit risk decision making and removed institutional knowledge from the process.

Downscaling: The Achilles heel of autoscaling Spark clusters

Prakhar Jain (Microsoft), Sourabh Goyal (Qubole)

Download slides (PDF)

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs. Upscaling a cluster in cloud is fairly easy as compared to downscaling nodes, and so the overall total cost of ownership (TCO) goes up. Prakhar Jain and Sourabh Goyal examine a new design to get efficient downscaling, which helps achieve better resource utilization and lower TCO.

Everything is connected and the clock is ticking: AI and big ag data for food security

Sara Menker (Gro Intelligence), Nemo Semret (Gro Intelligence)

Watch the keynote

Sara Menker, CEO, Gro Intelligence

Executive Briefing: Big data in the era of heavy worldwide privacy regulations

Mark Donsky (Okera)

Download slides (PDF)

California is following the EU's GDPR with the California Consumer Protection Act (CCPA) in 2020. Penalties for non-compliance, but many companies aren't prepared for this strict regulation. This session will explore the capabilities your data environment needs in order to simplify CCPA and GDPR compliance, as well as other regulations.

Executive Briefing: Data catalogs—Concepts, capabilities, and key platforms

Andrew Brust (Blue Badge Insights | ZDNet)

Download slides (PPTX)

Andrew Brust provides a primer on data catalogs and a review of the major vendors and platforms in the market. He examines the use of data catalogs with classic and newer data repositories, including data warehouses, data lakes, cloud object storage, and even software and applications. You'll learn about AI's role in the data catalog world and get an analysis of data catalog futures.

Executive Briefing: Understanding the cult of prediction

Farrah Bostic (The Difference Engine)

View slides

We're living in a culture obsessed with predictions. In politics and business, we collect data in service of the obsession. But our need for certainty and control leads some organizations to be duped by unproven technology or pseudoscience—often with unforeseen societal consequences. Farrah Bostic looks at historical—and sometimes funny—examples of sacrificing understanding for "data."

Executive Briefing: What it takes to use machine learning in fast data pipelines

Dean Wampler (Anyscale)

Download slides (PDF)

Dean Wampler dives into how (and why) to integrate ML into production streaming data pipelines and to serve results quickly; how to bridge data science and production environments with different tools, techniques, and requirements; how to build reliable and scalable long-running services; and how to update ML models without downtime.

Fair, privacy-preserving, and secure ML

Mikio Braun (Zalando)

Download slides (PDF)

With ML becoming more mainstream, the side effects of machine learning and AI on our lives become more visible. You have to take extra measures to make machine learning models fair and unbiased. And awareness for preserving the privacy in ML models is rapidly growing. Mikio Braun explores techniques and concepts around fairness, privacy, and security when it comes to machine learning models.

Foundations for successful data projects

Ted Malaska (Capital One), Jonathan Seidman (Cloudera), Matthew Schumpert (Cloudera, Inc.), Raman Rajasekhar (Cloudera Inc), Krishna Maheshwari (Cloudera)

Download slides (PDF)

The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

From isolated to connected: The metamorphosis of Revibe

Gwen Campbell (Revibe Technologies)

Download slides (PPTX)

It’s no surprise that Revibe needed to learn how to evolve to satiate the current data hungry market. It launched its first hardware-only device in 2015 and quickly learned that to stay alive, the company needed to get its hands into data. Gwen Campbell discusses Revibe's metamorphosis from a hardware company to a data company and shares lessons learned along the way.

From whiteboard to production: A demand forecasting system for an online grocery shop

Robert Pesch (inovex), Robin Senge (inovex)

Download slides (PDF)

Data-driven software is revolutionizing the world and enable intelligent services we interact with daily. Robert Pesch and Robin Senge outline the development process, statistical modeling, data-driven decision making, and components needed for productionizing a fully automated and highly scalable demand forecasting system for an online grocery shop for a billion-dollar retail group in Europe.

Gaining new insight into online customer behavior using AI

Moise Convolbo (Rakuten)

Download slides (PDF)

Customer satisfaction is a key success factor for any business. Moise Convolbo highlights the process to capture relevant customer behavioral data, cluster the user journey by different patterns, and draw conclusions for data-informed business decisions.

Handling data gaps in time series using imputation

Alfred Whitehead (Klick), clare jeon (Klick)

Download slides (PDF)

Time series forecasts depend on sensors or measurements made in the real, messy world. The sensors flake out, get turned off, disconnect, and otherwise conspire to cause missing signals. Signals that may tell you what tomorrow's temperature will be or what your blood glucose levels are before bed. Alfred Whitehead and Clare Jeon explore methods for handling data gaps and when to consider which.

Hands-on machine learning with Kafka-based streaming pipelines

Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)

Download slides (PDF)

Boris Lublinsky and Dean Wampler examine ML use in streaming data pipelines, how to do periodic model retraining, and low-latency scoring in live streams. Learn about Kafka as the data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, metadata tracking, and more.

How disruptive tech is reshaping the financial services industry

Swatee Singh (American Express)

Watch the keynote

The financial services industry is increasingly using disruptive technology—including AI and machine learning, edge computing, blockchain, mobile and mixed reality, virtual assistants, and quantum computing to name a few—to enhance the customer experience and personalize their interactions with customers. Swatee Singh outlines how the same is true at American Express.

How machine learning meets optimization

Jari Koister (FICO )

View slides

Machine learning and constraint-based optimization are both used to solve critical business problems. They come from distinct research communities and have traditionally been treated separately. But Jari Koister examines how they're similar, how they're different, and how they can be used to solve complex problems with amazing results.

How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE (BlueData))

Anant Chintamaneni (HPE (BlueData)), Matt Maccaux (HPE (BlueData))

Download slides (PPTX)

Anant Chintamaneni and Matt Maccaux explore whether the combination of containers with large-scale distributed data analytics and machine learning applications is like combining oil and water— or like peanut butter and chocolate.

Improve your data science ROI with a portfolio and risk management lens

Brian Dalessandro (Capital One)

Download slides (PDF)

While data science value is well recognized within tech, experience across industries shows that the ability to realize and measure business impact is not universal. A core issue is that data science programs face unique risks many leaders aren’t trained to hedge against. Brian Dalessandro addresses these risks and advocates for new ways to think about and manage data science programs.

Improving Spark by taking advantage of disaggregated architecture

Chenzhao Guo (Intel), Carson Wang (Intel)

Download slides (FILE)

Shuffle in Spark requires the shuffle data to be persisted on local disks. However, the assumptions of collocated storage do not always hold in today’s data centers. Chenzhao Guo and Carson Wang outline the implementation of a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends, making life easier for customers.

Interactive sports analytics

Patrick Lucey (Stats Perform)

Watch the keynote

Imagine watching sports and being able to immediately find all plays that are similar to what just happened. Better still, imagine being able to draw a play with the Xs and Os on an interface like a coach draws on a chalkboard and instantaneously finding all the similar plays and conduct analytics on those plays. Join Patrick Lucey to see how this is possible.

Learning with limited labeled data

Shioulin Sam (Cloudera Fast Forward Labs)

Download slides (PDF)

Supervised machine learning requires large labeled datasets—a prohibitive limitation in many real world applications. But this could be avoided if machines could earn with a few labeled examples. Shioulin Sam explores and demonstrates an algorithmic solution that relies on collaboration between human and machine to label smartly, and she outlines product possibilities.

Machine learning in data quality management

Jennifer Yang (Wells Fargo ECS)

Download slides (1-PPTX)

Download slides (2-PPTX)

Jennifer Yang discusses a use case that demonstrates how to use machine learning techniques in the data quality management space in the financial industry. You'll discover the results of applying various machine learning techniques in the four most commonly defined data validation categories and learn approaches to operationalize the machine learning data quality management solution.

Managing your Kafka in an explosive growth environment

Alon Gavra (AppsFlyer)

View slides

Frequently, Kafka is just a piece of the stack that lives in production that often times no one wants to touch—because it just works. Alon Gavra outlines how Kafka sits at the core of AppsFlyer's infrastructure that processes billions of events daily.

Migrating millions of users from voice- and email-based customer support to a chatbot

Madhu Gopinathan (MakeMyTrip), Sanjay Mohan (MakeMyTrip)

Download slides (PDF)

At MakeMyTrip customers were using voice or email to contact agents for postsale support. In order to improve the efficiency of agents and improve customer experience, MakeMyTrip developed a chatbot, Myra, using some of the latest advances in deep learning. Madhu Gopinathan and Sanjay Mohan explain the high-level architecture and the business impact Myra created.

Natural language understanding at scale with Spark NLP

David Talby (Pacific AI), Alex Thomas (John Snow Labs), Saif Addin Ellafi (John Snow Labs), Claudiu Branzan (Accenture)

View slides

David Talby, Alex Thomas, Saif Addin Ellafi, and Claudiu Branzan walk you through state-of-the-art natural language processing (NLP) using the highly performant, highly scalable open source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Now you see me; now you compute: Building event-driven architectures with Apache Kafka

Michael Noll (Confluent)

View slides

Would you cross the street with traffic information that's a minute old? Certainly not. Modern businesses have the same needs. Michael Noll explores why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, you look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer.

Online machine learning in streaming applications

Stavros Kontopoulos (Lightbend), Debasish Ghosh (Lightbend)

Download slides (ZIP)

Stavros Kontopoulos and Debasish Ghosh explore online machine learning algorithm choices for streaming applications, especially those with resource-constrained use cases like IoT and personalization. They dive into Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms from implementation to production deployment, describing the pros and cons of each of them.

Orchestrating data workflows using a fully serverless architecture

Tomer Levi (Fundbox)

Download slides (PDF)

Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy-to-use, scalable, and flexible data workflow platform is a complex undertaking. Tomer Levi walks you through how the data engineering team at Fundbox uses AWS serverless technologies to address this problem and how it enables data scientists, BI devs, and engineers move faster.

Parquet modular encryption: Confidentiality and integrity of sensitive column data

Gidon Gershinsky (IBM)

Download slides (PDF)

The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases.

Practical feature engineering

Ted Dunning (MapR, now part of HPE)

Download slides (PDF)

Feature engineering is generally the section that gets left out of machine learning books, but it's also the most critical part in practice. Ted Dunning explores techniques, a few well known, but some rarely spoken of outside the institutional knowledge of top teams, including how to handle categorical inputs, natural language, transactions, and more in the context of machine learning.

Protecting the healthcare enterprise from PHI breaches using streaming and NLP

Jeff Zemerick (Mountain Fog)

Download slides (PPTX)

Hospitals small and large are adopting cloud technologies, and many are in hybrid environments. These distributed environments pose challenges, none of which are more critical than the protection of protected health information (PHI). Jeff Zemerick explores how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment.

Recent trends in data and machine learning technologies

Ben Lorica (O'Reilly)

Watch the keynote

Ben Lorica dives into emerging technologies for building data infrastructures and machine learning platforms.

Running AI workloads in containers (sponsored by BMC Software)

See-Kit Lam (Malwarebytes), Darren Chinen (Malwarebytes)

Download slides (PDF)

Developing, deploying and managing AI and anomaly detection models is tough business. See-Kit Lam details how Malwarebytes has leveraged containerization, scheduling, and orchestration to build a behavioral detection platform and a pipeline to bring models from concept to production.

Say what? The ethical challenges of designing for humanlike interaction

Jonathan Foster (Microsoft)

Watch the keynote

Language shapes our thinking, our relationships, our sense of self. Conversation connects us in powerful, intimate, and often unconscious ways. Jonathan Foster explains why, as we design for natural language interactions and more humanlike digital experiences, language—as design material, conversation, and design canvas—reveals ethical challenges we couldn't encounter with GUI-powered experiences.

Scalable anomaly detection with Spark and SOS

Jeroen Janssens (Data Science Workshops)

Download slides (PDF)

Jeroen Janssens dives into stochastic outlier section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS has been implemented in Python, R, and, most recently, Spark. He illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of Spark, and applies SOS to a real-world use case.

Search logs + machine learning = autotagged inventory

John Berryman (Eventbrite)

View slides

Eventbrite is exploring a new machine learning approach that allows it to harvest data from customer search logs and automatically tag events based upon their content. John Berryman dives into the results and how they have allowed the company to provide users with a better inventory-browsing experience.

Sketching data and other magic tricks

Sophie Watson (Red Hat), William Benton (Red Hat)

Download slides (PDF)

Go hands-on with Sophie Watson and William Benton to examine data structures that let you answer interesting queries about massive datasets in fixed amounts of space and constant time. This seems like magic, but they'll explain the key trick that makes it possible and show you how to use these structures for real-world machine learning and data engineering applications.

Solve tomorrow’s business challenges with a modern data warehouse (sponsored by Matillion)

Daniel D'Orazio (Matillion)

Download slides (PDF)

According to Forrester, insight-driven companies are on pace to make $1.8 trillion annually by 2021. Daniel D'Orazio wants to know how fast your team can collect, process, and analyze data to solve present—and future—business challenges. You'll gain actionable tips and lessons learned from cloud data warehouse modernizations at companies like DocuSign that you can take back to your business.

Spark on Kubernetes for data science

Jordan Volz (Dataiku)

Download slides (PDF)

Spark on Kubernetes is a winning combination for data science that stitches together a flexible platform harnessing the best of both worlds. Jordan Volz gives a brief overview of Spark and Kubernetes, the Spark on Kubernetes project, why it’s an ideal fit for data scientists who may have been dissatisfied with other iterations of Spark in the past, and some applications.

Staying safe in the AI era

Cassie Kozyrkov (Google)

Watch the keynote

Machine learning and artificial intelligence are no longer science fiction, so now you have to address what it takes to harness their potential effectively, responsibly, and reliably. Based on lessons learned at Google, Cassie Kozyrkov offers actionable advice to help you find opportunities to take advantage of machine learning, navigate the AI era, and stay safe as you innovate.

The attribution problem

Tusharadri Mukherjee (Lenovo)

Download slides (PDF)

Attribution of media spend is a common problem shared by people in many different roles and industries. Many of the solutions that are simplest and easiest to implement don't drive the right behavior. Tushar Mukherjee shares practical lessons learned developing and applying multivariate attribution models at Lenovo.

The evolution of metadata: LinkedIn’s story

Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)

Download slides (PDF)

Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIn’s metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars.

The future of Google Cloud data processing (sponsored by Google Cloud)

James Malone (Google)

Watch the keynote

Open source has always been a core pillar of Google Cloud’s data and analytics strategy. James Malone examines how, as the community continues to set industry standards, the company continues to integrate those standards into its services so organizations around the world can unlock the value of data faster.

The future of stablecoin

Catherine Gu (Stanford University)

Download slides (PDF)

With the emergence of cryptoeconomy, there is a real demand for an alternative form of money. Major cryptocurrencies such as Bitcoin and Ethereum have thus far failed to achieve mass adoption. Catherine Gu examines the paradigm of algorithmic design of stablecoins, focusing on incentive structure and decentralized governance, to evaluate the role of stablecoin as a future medium of exchange.

ThirdEye: LinkedIn’s business-wide monitoring platform

Akshay Rai (Linkedin)

Download slides (PDF)

Failures or issues in a product or service can negatively affect the business. Detecting issues in advance and recovering from them is crucial to keeping the business alive. Join Akshay Rai to learn more about LinkedIn's next-generation open source monitoring platform, an integrated solution for real-time alerting and collaborative analysis.

Trill: The crown jewel of Microsoft’s streaming pipeline explained

James Terwilliger (Microsoft Corporation), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)

Download slides (PPTX)

Trill has been open-sourced, making the streaming engine behind services like the Bing Ads platform available for all to use and extend. James Terwilliger, Badrish Chandramouli, and Jonathan Goldstein dive into the history of and insights from streaming data at Microsoft. They demonstrate how its API can power complex application logic and the performance that gives the engine its name.

Unleash the power of data at scale (sponsored by Intel)

Jeremy Rader (Intel)

Watch the keynote

Data analytics is the long-standing but constantly evolving science that companies leverage for insight, innovation, and competitive advantage. Jeremy Rader explores Intel’s end-to-end data pipeline software strategy designed and optimized for a modern and flexible data-centric infrastructure that allows for the easy deployment of unified advanced analytics and AI solutions at scale.

Using Spark for crunching astronomical data on the LSST scale

Petar Zecevic (SV Group)

Download slides (PPTX)

The Large Scale Survey Telescope (LSST) is one of the most important future surveys. Its unique design allows it to cover large regions of the sky and obtain images of the faintest objects. After 10 years of operation, it will produce about 80 PB of data in images and catalog data. Petar Zecevic explains AXS, a system built for fast processing and cross-matching of survey catalog data.

What does the public say? A computational analysis of regulatory comments

Vlad Eidelman (FiscalNote)

Download slides (PDF)

While regulations affect your life every day, and millions of public comments are submitted to regulatory agencies in response to their proposals, analyzing the comments has traditionally been reserved for legal experts. Vlad Eidelman outlines how natural language processing (NLP) and machine learning can be used to automate the process by analyzing over 10 million publicly released comments.

Where's my lookup table? Modeling relational data in a denormalized world

Rick Houlihan (Amazon Web Services)

View slides

Data has always been and will always be relational. NoSQL databases are gaining in popularity, but that doesn't change the fact that the data is still relational, it just changes how we have to model the data. Rick Houlihan dives deep into how real entity relationship models can be efficiently modeled in a denormalized manner using schema examples from real application services.

Working with time series: Denoising and imputation frameworks to improve data density

Anjali Samani (CircleUp)

Download slides (PDF)

The application of smoothing and imputation strategies is common practice in predictive modeling and time series analysis. With a technique-agnostic approach, Anjali Samani provides qualitative and quantitative frameworks that address questions related to smoothing and imputation of missing values to improve data density.

Your easy move to serverless computing and radically simplified data processing

Gil Vernik (IBM)

Download slides (PDF)

Most analytic flows can benefit from serverless, starting with simple cases to and moving to complex data preparations for AI frameworks like TensorFlow. To address the challenge of how to easily integrate serverless without major disruptions to your system, Gil Vernik explores the “push to the cloud” experience, which dramatically simplifies serverless for big data processing frameworks.