AI has the potential to add $16 trillion global economy by 2030, but adoption has been slow. While we understand the power of AI, many of us aren’t sure how to fully unleash its potential. Join Robert Thomas and Tim O'Reilly to learn that the reality is AI isn't magic. It’s hard work.
Graphs are a powerful way to represent knowledge. Organizations, in fields such as biosciences and finance, are starting to amass large knowledge graphs, but they lack the machine learning tools to extract insights from them. David Mack offers an overview of what insights are possible and surveys the most popular approaches.
Wangda Tan and Wei-Chiu Chuang outline the current status of Apache Hadoop community and dive into present and future of Hadoop 3.x. You'll get a peak at new features like erasure coding, GPU support, NameNode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. And they walk you through upgrade guidance from 2.x to 3.x.
Jungwook Seo walks you through a data analytics platform in the cloud by the name of AccuInsight+ with eight data analytic services in the CloudZ (one of the biggest cloud service providers in Korea), which SK Holdings announced in January 2019.
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.
Viridiana Lourdes explains how banks and financial enterprises can adopt and integrate actual risk models with existing systems to enhance the performance and operational efficiency of the financial crimes organization. Join in to learn how actual risk models can reduce segmentation noise, utilize unlabeled transactional data, and spot unusual behavior more effectively.
Mumin Ransom gives an overview of the data management and privacy challenges around automating ML model (re)deployments and stream-based inferencing at scale.
Each week 275 million people shop at Walmart, generating interaction and transaction data. Navinder Pal Singh Brar explains how the customer backbone team enables extraction, transformation, and storage of customer data to be served to other teams. At 5 billion events per day, the Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer.
Moty Fania details Intel’s IT experience of implementing a sales AI platform. This platform is based on streaming, microservices architecture with a message bus backbone. It was designed for real-time data extraction and reasoning and handles the processing of millions of website pages and is capable of sifting through millions of tweets per day.
Siva Sivakumar explains the Cisco Data Intelligence Platform (CDIP), which is a cloud-scale architecture that brings together big data, AI and compute farm, and storage tiers to work together as a single entity, while also being able to scale independently to address the IT issues in the modern data center.
According to Gartner, over 80% of data lake projects were deemed inefficient. Data lakes come and go. Swamps happen. Data agility is fleeting. Chuck Yarbrough walks you through how data ops practices and a modern data architecture bring greater visibility and allow faster data access with proper governance.
Using SAS, Python, and AWS SageMaker, Major League Baseball's (MLB's) data science team outlines how it predicts ticket purchasers’ likelihood to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.
Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past four years, data security and privacy anti-patterns have emerged across hundreds of customers and industry verticals—there's been an obvious trend. Steven Touw details five anti-patterns and, more importantly, the solutions for them.
Based on a critical evaluation of the iconic yield curve chart, Alan Smith argues that combining visualization (data to pixels) with sonification (data to pitch) offers potential to improve not only aesthetic multimedia experiences but also an opportunity to take the presentation of data into the rapidly expanding universe of screenless devices and products.
You'll go hands-on to learn the theoretical foundations and principal ideas underlying deep learning and neural networks. Bruno Gonçalves provides the code structure of the implementations that closely resembles the way Keras is structured, so that by the end of the course, you'll be prepared to dive deeper into the deep learning applications of your choice.
Garrett Hoffman walks you through deep learning methods for natural language processing and natural language understanding tasks, using a live example in Python and TensorFlow with StockTwits data. Methods include Word2Vec, recurrent neural networks (RNNs) and variants (long short-term memory [LSTM] and gated recurrent unit [GRU]), and convolutional neural networks.
Sajan Govindan outlines CERN’s research on deep learning in high energy physics experiments as an alternative to customized rule-based methods with an example of topology classification to improve real-time event selection at the Large Hadron Collider. CERN uses deep learning pipelines on Apache Spark using BigDL and Analytics Zoo open source software on Intel Xeon-based clusters.
In this keynote, we’ll introduce you to the new 100% open source Cloudera Data Platform (CDP), the world’s first enterprise data cloud. CDP is hybrid and multi-cloud, delivering the speed, agility, and scale you need to secure and govern your data anywhere from the edge to AI.
Tokyo Century was ready for a change. Credit risk decisions were taking too long and the home office was taking notice. The company needed a full stack data solution to increase the speed of loan authorizations, and it needed it quickly. Moto Tohda explains how Tokyo Century put data at the center of its credit risk decision making and removed institutional knowledge from the process.
Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs. Upscaling a cluster in cloud is fairly easy as compared to downscaling nodes, and so the overall total cost of ownership (TCO) goes up. Prakhar Jain and Sourabh Goyal examine a new design to get efficient downscaling, which helps achieve better resource utilization and lower TCO.
Sara Menker, CEO, Gro Intelligence
California is following the EU's GDPR with the California Consumer Protection Act (CCPA) in 2020. Penalties for non-compliance, but many companies aren't prepared for this strict regulation.
This session will explore the capabilities your data environment needs in order to simplify CCPA and GDPR compliance, as well as other regulations.
Andrew Brust provides a primer on data catalogs and a review of the major vendors and platforms in the market. He examines the use of data catalogs with classic and newer data repositories, including data warehouses, data lakes, cloud object storage, and even software and applications. You'll learn about AI's role in the data catalog world and get an analysis of data catalog futures.
We're living in a culture obsessed with predictions. In politics and business, we collect data in service of the obsession. But our need for certainty and control leads some organizations to be duped by unproven technology or pseudoscience—often with unforeseen societal consequences. Farrah Bostic looks at historical—and sometimes funny—examples of sacrificing understanding for "data."
Dean Wampler dives into how (and why) to integrate ML into production streaming data pipelines and to serve results quickly; how to bridge data science and production environments with different tools, techniques, and requirements; how to build reliable and scalable long-running services; and how to update ML models without downtime.
With ML becoming more mainstream, the side effects of machine learning and AI on our lives become more visible. You have to take extra measures to make machine learning models fair and unbiased. And awareness for preserving the privacy in ML models is rapidly growing. Mikio Braun explores techniques and concepts around fairness, privacy, and security when it comes to machine learning models.
The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.
It’s no surprise that Revibe needed to learn how to evolve to satiate the current data hungry market. It launched its first hardware-only device in 2015 and quickly learned that to stay alive, the company needed to get its hands into data. Gwen Campbell discusses Revibe's metamorphosis from a hardware company to a data company and shares lessons learned along the way.
Data-driven software is revolutionizing the world and enable intelligent services we interact with daily. Robert Pesch and Robin Senge outline the development process, statistical modeling, data-driven decision making, and components needed for productionizing a fully automated and highly scalable demand forecasting system for an online grocery shop for a billion-dollar retail group in Europe.
Customer satisfaction is a key success factor for any business. Moise Convolbo highlights the process to capture relevant customer behavioral data, cluster the user journey by different patterns, and draw conclusions for data-informed business decisions.
Time series forecasts depend on sensors or measurements made in the real, messy world. The sensors flake out, get turned off, disconnect, and otherwise conspire to cause missing signals. Signals that may tell you what tomorrow's temperature will be or what your blood glucose levels are before bed. Alfred Whitehead and Clare Jeon explore methods for handling data gaps and when to consider which.
Boris Lublinsky and Dean Wampler examine ML use in streaming data pipelines, how to do periodic model retraining, and low-latency scoring in live streams. Learn about Kafka as the data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, metadata tracking, and more.
The financial services industry is increasingly using disruptive technology—including AI and machine learning, edge computing, blockchain, mobile and mixed reality, virtual assistants, and quantum computing to name a few—to enhance the customer experience and personalize their interactions with customers. Swatee Singh outlines how the same is true at American Express.
Machine learning and constraint-based optimization are both used to solve critical business problems. They come from distinct research communities and have traditionally been treated separately. But Jari Koister examines how they're similar, how they're different, and how they can be used to solve complex problems with amazing results.
Anant Chintamaneni and Matt Maccaux explore whether the combination of containers with large-scale distributed data analytics and machine learning applications is like combining oil and water— or like peanut butter and chocolate.
While data science value is well recognized within tech, experience across industries shows that the ability to realize and measure business impact is not universal. A core issue is that data science programs face unique risks many leaders aren’t trained to hedge against. Brian Dalessandro addresses these risks and advocates for new ways to think about and manage data science programs.
Shuffle in Spark requires the shuffle data to be persisted on local disks. However, the assumptions of collocated storage do not always hold in today’s data centers. Chenzhao Guo and Carson Wang outline the implementation of a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends, making life easier for customers.
Imagine watching sports and being able to immediately find all plays that are similar to what just happened. Better still, imagine being able to draw a play with the Xs and Os on an interface like a coach draws on a chalkboard and instantaneously finding all the similar plays and conduct analytics on those plays. Join Patrick Lucey to see how this is possible.
Supervised machine learning requires large labeled datasets—a prohibitive limitation in many real world applications. But this could be avoided if machines could earn with a few labeled examples. Shioulin Sam explores and demonstrates an algorithmic solution that relies on collaboration between human and machine to label smartly, and she outlines product possibilities.
Jennifer Yang discusses a use case that demonstrates how to use machine learning techniques in the data quality management space in the financial industry. You'll discover the results of applying various machine learning techniques in the four most commonly defined data validation categories and learn approaches to operationalize the machine learning data quality management solution.
Frequently, Kafka is just a piece of the stack that lives in production that often times no one wants to touch—because it just works. Alon Gavra outlines how Kafka sits at the core of AppsFlyer's infrastructure that processes billions of events daily.
At MakeMyTrip customers were using voice or email to contact agents for postsale support. In order to improve the efficiency of agents and improve customer experience, MakeMyTrip developed a chatbot, Myra, using some of the latest advances in deep learning. Madhu Gopinathan and Sanjay Mohan explain the high-level architecture and the business impact Myra created.
David Talby, Alex Thomas, Saif Addin Ellafi, and Claudiu Branzan walk you through state-of-the-art natural language processing (NLP) using the highly performant, highly scalable open source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.
Would you cross the street with traffic information that's a minute old? Certainly not. Modern businesses have the same needs. Michael Noll explores why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, you look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer.
Stavros Kontopoulos and Debasish Ghosh explore online machine learning algorithm choices for streaming applications, especially those with resource-constrained use cases like IoT and personalization. They dive into Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms from implementation to production deployment, describing the pros and cons of each of them.
Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy-to-use, scalable, and flexible data workflow platform is a complex undertaking. Tomer Levi walks you through how the data engineering team at Fundbox uses AWS serverless technologies to address this problem and how it enables data scientists, BI devs, and engineers move faster.
The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases.
Feature engineering is generally the section that gets left out of machine learning books, but it's also the most critical part in practice. Ted Dunning explores techniques, a few well known, but some rarely spoken of outside the institutional knowledge of top teams, including how to handle categorical inputs, natural language, transactions, and more in the context of machine learning.
Hospitals small and large are adopting cloud technologies, and many are in hybrid environments. These distributed environments pose challenges, none of which are more critical than the protection of protected health information (PHI). Jeff Zemerick explores how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment.
Ben Lorica dives into emerging technologies for building data infrastructures and machine learning platforms.
Developing, deploying and managing AI and anomaly detection models is tough business. See-Kit Lam details how Malwarebytes has leveraged containerization, scheduling, and orchestration to build a behavioral detection platform and a pipeline to bring models from concept to production.
Language shapes our thinking, our relationships, our sense of self. Conversation connects us in powerful, intimate, and often unconscious ways. Jonathan Foster explains why, as we design for natural language interactions and more humanlike digital experiences, language—as design material, conversation, and design canvas—reveals ethical challenges we couldn't encounter with GUI-powered experiences.
Jeroen Janssens dives into stochastic outlier section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS has been implemented in Python, R, and, most recently, Spark. He illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of Spark, and applies SOS to a real-world use case.
Eventbrite is exploring a new machine learning approach that allows it to harvest data from customer search logs and automatically tag events based upon their content. John Berryman dives into the results and how they have allowed the company to provide users with a better inventory-browsing experience.
Go hands-on with Sophie Watson and William Benton to examine data structures that let you answer interesting queries about massive datasets in fixed amounts of space and constant time. This seems like magic, but they'll explain the key trick that makes it possible and show you how to use these structures for real-world machine learning and data engineering applications.
According to Forrester, insight-driven companies are on pace to make $1.8 trillion annually by 2021. Daniel D'Orazio wants to know how fast your team can collect, process, and analyze data to solve present—and future—business challenges. You'll gain actionable tips and lessons learned from cloud data warehouse modernizations at companies like DocuSign that you can take back to your business.
Spark on Kubernetes is a winning combination for data science that stitches together a flexible platform harnessing the best of both worlds. Jordan Volz gives a brief overview of Spark and Kubernetes, the Spark on Kubernetes project, why it’s an ideal fit for data scientists who may have been dissatisfied with other iterations of Spark in the past, and some applications.
Machine learning and artificial intelligence are no longer science fiction, so now you have to address what it takes to harness their potential effectively, responsibly, and reliably. Based on lessons learned at Google, Cassie Kozyrkov offers actionable advice to help you find opportunities to take advantage of machine learning, navigate the AI era, and stay safe as you innovate.
Attribution of media spend is a common problem shared by people in many different roles and industries. Many of the solutions that are simplest and easiest to implement don't drive the right behavior. Tushar Mukherjee shares practical lessons learned developing and applying multivariate attribution models at Lenovo.
Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIn’s metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars.
Open source has always been a core pillar of Google Cloud’s data and analytics strategy. James Malone examines how, as the community continues to set industry standards, the company continues to integrate those standards into its services so organizations around the world can unlock the value of data faster.
With the emergence of cryptoeconomy, there is a real demand for an alternative form of money. Major cryptocurrencies such as Bitcoin and Ethereum have thus far failed to achieve mass adoption. Catherine Gu examines the paradigm of algorithmic design of stablecoins, focusing on incentive structure and decentralized governance, to evaluate the role of stablecoin as a future medium of exchange.
Failures or issues in a product or service can negatively affect the business. Detecting issues in advance and recovering from them is crucial to keeping the business alive. Join Akshay Rai to learn more about LinkedIn's next-generation open source monitoring platform, an integrated solution for real-time alerting and collaborative analysis.
Trill has been open-sourced, making the streaming engine behind services like the Bing Ads platform available for all to use and extend. James Terwilliger, Badrish Chandramouli, and Jonathan Goldstein dive into the history of and insights from streaming data at Microsoft. They demonstrate how its API can power complex application logic and the performance that gives the engine its name.
Data analytics is the long-standing but constantly evolving science that companies leverage for insight, innovation, and competitive advantage. Jeremy Rader explores Intel’s end-to-end data pipeline software strategy designed and optimized for a modern and flexible data-centric infrastructure that allows for the easy deployment of unified advanced analytics and AI solutions at scale.
The Large Scale Survey Telescope (LSST) is one of the most important future surveys. Its unique design allows it to cover large regions of the sky and obtain images of the faintest objects. After 10 years of operation, it will produce about 80 PB of data in images and catalog data. Petar Zecevic explains AXS, a system built for fast processing and cross-matching of survey catalog data.
While regulations affect your life every day, and millions of public comments are submitted to regulatory agencies in response to their proposals, analyzing the comments has traditionally been reserved for legal experts. Vlad Eidelman outlines how natural language processing (NLP) and machine learning can be used to automate the process by analyzing over 10 million publicly released comments.
Data has always been and will always be relational. NoSQL databases are gaining in popularity, but that doesn't change the fact that the data is still relational, it just changes how we have to model the data. Rick Houlihan dives deep into how real entity relationship models can be efficiently modeled in a denormalized manner using schema examples from real application services.
The application of smoothing and imputation strategies is common practice in predictive modeling and time series analysis. With a technique-agnostic approach, Anjali Samani provides qualitative and quantitative frameworks that address questions related to smoothing and imputation of missing values to improve data density.
Most analytic flows can benefit from serverless, starting with simple cases to and moving to complex data preparations for AI frameworks like TensorFlow. To address the challenge of how to easily integrate serverless without major disruptions to your system, Gil Vernik explores the “push to the cloud” experience, which dramatically simplifies serverless for big data processing frameworks.