This session will cover the value that linking algorithms bring to identity risk management, and how to apply linking algorithms, data and super compute capability to the challenge of identity risk management and identity fraud. We will also look at patterns of identity fraud, namely those (stolen) identities that have come back from the dead and how to differ those from real, live identities.
Apache Samza is a framework for processing high-volume real-time event streams. In this session we will walk through our experiences of putting Samza into production at LinkedIn, discuss how it compares to other stream processing tools, and share the lessons we learnt about dealing with real-time data at scale.
A talk about how the largest professional social network in the world is digitally mapping the global economy to connect talent with opportunity at massive scale.
By reducing friction from deploying models and comparing competing models, data scientists can focus on high-value efforts. At Vast we've experimented with tools and strategies for this while shipping a suite of data products for consumers and agents in the midst of some of life’s biggest purchases. I'll share best practices and lessons learned, and help you free up time for the fun stuff.
LinkedIn processes enormous amounts of events each day. In this talk, you will learn the background of the data challenges that LinkedIn faced, how the teams came together to construct the solution, and the underlying stack structure powering this solution including an interactive analytics infrastructure and a self-serve data visualization frontend solution at fast scale.
A lot of stationary, big data begins its life as small data in rapid motion - think logs, sensors, social data. The pressure is on architects, infra devops, and app developers to harness real-time data, and expose it to the right data processing paradigm. Learn how on AWS, services like Amazon Kinesis, Redshift, and Elastic MapReduce can be composed to deliver a smarter big data infrastructure.
Leveraging our experience from working on some of the largest-scale high-growth applications at Facebook and other companies, including building the most popular data analysis tool Scuba, this talk outlines 10 lessons learned, along with best practices towards extracting the most value out of data, while avoiding common pitfalls.
Microsoft Translator currently supports 100+ languages. We constantly improve the translation quality, add new scenarios, all with a constant team size. This session describes a production scale ML architecture using MS Translator as a case study. You will learn the mental model to approach your ML problem and concrete Do’s and Don’ts for the various components of the ML system architecture.
Open Source Real Time BI using Storm, Hadoop, Titan, Druid & D3
Organizations often showcase the virtues of their data platforms, but rarely share the challenges and decisions faced along the way. Our session describes how we architected our analytics stack around Druid, an open source distributed data store, and how we overcame the challenges around scaling the system, balancing features with cost, and making performance consistent.
An increasingly common task for data science is the measurement and attribution of experimental impact. Using examples from healthcare.gov, Microsoft advertising, and Bing experimentation, we will explore the strengths, weaknesses, and pitfalls of techniques for dealing with impact and attribution in scenarios/data in which control experiments were not possible or otherwise not performed.
Netflix continues evolve its big data architecture in the cloud with performance enhancements and updated OSS offerings. We will share our experiences and selections in file formats, interactive query engines, and instance types. Genie emerges with updates to support YARN applications and we will unveil a new performance visualization tool, Inviso.
Just in the US, we make over ~40 billion queries every month. From the time we wake up, search engines are one of the top activities we do online, this talk will show some examples on how this data can be used from funny things like determining which city wakes up earlier to more complex scenarios like finding adverse drug interactions.