Building and maintaining complex distributed systems
June 19–20, 2017: Training
June 20–22, 2017: Tutorials & Conference
San Jose, CA

Speaker slides

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Tammy Everts (SpeedCurve)
Tammy Everts walks you through a brief history of UX and web performance research, highlighting key studies that connect the dots between performance and user experience and sharing some educated guesses about new metrics that are just around the corner.
Bart De Vylder (CoScale)
Data science is a hot topic. Bart De Vylder offers a practical introduction that goes beyond the hype, exploring data analysis, visualization, and machine-learning techniques using Python for modeling the behavior of distributed systems. You'll leave with a solid starting point to implement data science techniques in your infrastructure or domain of interest.
Lisa van Gelder (Bauer Xcel Media)
Lisa van Gelder shares what she learned from an accidental A/B test. Last year, she interviewed for a new executive job at the same time as two (white, male) friends, and they compared notes. Lisa explains how "unqualified" is used to reject marginalized groups in tech and what we can do about it—both as individuals interviewing and as hiring managers looking to improve the interview process.
Alexander Grbic (Intel)
Alex Grbic explains how a single FPGA can deliver significant acceleration for multiple workloads. This new approach of integrating data analytics frameworks and existing databases enables enterprise customers to run unmodified applications without requiring any FPGA expertise and can be used with unstructured, NoSQL, and traditional relational databases, such as Swarm64.
Bryan Liles (Heptio)
In the past, applications were monolithic, and tracing flows for performance and bottlenecks was straightforward, as there was likely a single code base. In today's world, with multiple processes constituting a single application, tracing becomes more challenging. Bryan Liles offers a hands-on demonstration for implementing tracing in modern applications.
Rock Mutchler (Pythian)
Rock Mutchler shares best practices for designing and deploying resilient, fault-tolerant systems on AWS and offers deep dives into managed versus unmanaged services, monitoring and observability, high-availability design patterns, fault-tolerant and self-healing systems, disaster recovery and business continuity approaches, and DDoS mitigation.
Brad Stoner (AppDynamics)
As release velocity increases, teams are finding innovative ways to detect and resolve performance issues earlier in the development cycle. Brad Stoner explores how to implement an automated performance testing strategy and explains how leveraging APM (application performance management) tools can reduce time to market while increasing overall quality.
Ann Kilzer (Indeed)
Remember the old practice of the canary in the coal mine, where miners used fragile feathered friends as a failure detector for toxic gasses? In software, a canary run is a trial executed on one machine before the rest of the cluster runs. Ann Kilzer explains how Indeed created a canary service leveraging Consul’s key value store to improve the resilience of data reloads in any infrastructure.
Avantika Mathur (Electric Cloud)
Avan Mathur shares strategies for database deployments and rollbacks as well as some patterns and best practices for reliably deploying databases as part of your CD pipeline, safely rolling back database code, ensuring data integrity, and more.
Tammy Butow (Gremlin)
Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Tammy Butow leads a hands-on tutorial on chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization.
Karl Isenberg (Mesosphere)
The orchestration space is fast moving and full of competing products, platforms, and frameworks. How do you choose the right one for your requirements? Karl Isenberg explores the features of several container orchestrators, breaking down the feature sets and characteristics into categories and scoring multiple solutions against each other, and discusses what's new this year.
Laine Campbell (Fastly), Charity Majors (Honeycomb)
SRE is becoming quite the ubiquitous term, but what about DBRE? Laine Campbell and Charity Majors dive into DBRE, exploring the paths to this craft and how to culturally evolve and support it. Laine and Charity focus on organizational scale, self-service, and force multipliers in recoverability, observability, availability, security, release management, and infrastructure.
David Hayes (PagerDuty)
Growing companies are customer-centric, and all members of an organization are now responsible for contributing to the customer experience. David Hayes explains why DevOps is a requirement for success and outlines some of the challenges that all DevOps teams will face over the next five years.
Roy Rapoport (Netflix)
When you're a scrappy startup, being nimble, agile, and flexible comes with the territory. But how do you maintain agility when you're a much, much larger company? Hope is not lost. Roy Rapoport shares critical leadership practices—focusing on encouraging failure, growing heretics, and empowering dissent—that will help you maintain a technical and organizational edge.
Artur Bergman (Fastly)
When Fastly CEO Artur Bergman helped organize the first Velocity event 10 years ago, the tech landscape was very different. Artur looks back at the last decade of DevOps and explores shifting patterns in operations, development, and systems through the lens of the Velocity Conference.
David Radcliffe (Shopify)
The flexibility and speed offered by cloud computing solutions have raised the bar for bare metal deployments. Automation is essential to speedy, reliable provisioning and capacity management. David Radcliffe explores the tools Shopify uses, such as Genesis, to automate its data center and empower developers to move quickly and keep up with the times.
Jack Chan (Shutterfly)
Jack Chan describes how Shutterfly migrated metadata from over 10B photos from a private data center into AWS in 100 days and explores designs to absorb mountains of metadata, on-premises ecommerce integration, and parallel user experiences, all in a highly scalable fashion. Shutterfly Photos is now a hybrid cloud solution with images hosted on-premises and client-facing photos metadata on AWS.
Henry Robinson (Cloudera)
It seems like everyone is building a distributed system. However, there's no common body of knowledge about how these systems should be built and scaled, beyond what is squirreled away in various academic papers. Henry Robinson shares lessons learned from over eight years spent building distributed systems and outlines a framework for thinking about distributed scaling challenges.
Corey Scobie (Akamai)
Using statistics about internet traffic patterns and growth from the past decade as a backdrop, Corey Scobie shares insights as to how and why edge computing clouds are so critical to the success of builders of scalable apps.
Patrick Hill (Atlassian)
Ever had an incident that didn't go as planned? Patrick Hill shares five values developed by Atlassian SREs to better handle incident management.
Arijit Mukherji (SignalFx)
Modern infrastructure and DevOps practices are evolving rapidly. These trends pose a new set of monitoring challenges. Arijit Mukherji shares real-world examples demonstrating what these challenges are, some approaches that worked, and metrics system capabilities that helped SignalFx deal with the challenge.
Dharma Shukla (Microsoft)
Dharma Shukla explores Azure Cosmos DB, discussing the internals of the system design and the various design trade-offs Azure had to make while building the service. Dharma also shares his experience and lessons learned operating a globally distributed database service worldwide while maintaining comprehensive service level agreements.
K Vignos (Twitter)
Constant change—caused by high attrition, frequent reorganization, shifting priorities, and management turnover, among other reasons—is the new normal. It takes months to onboard a new team member and get them adding value. Kathleen Vignos offers tips, shortcuts, and stories for stabilizing a team and finding a path to productivity amid the chaos.
Seth Vargo (Google)
It's great that you've moved to microservices, but how are you distributing secrets? Seth Vargo offers an overview of Vault's unique approach to secret management by providing secrets as a service for your services (and humans too), which is highly scalable and easily customizable to fit any environment.
Armon Dadgar (HashiCorp)
Armon Dadgar offers an overview of Nomad, an application scheduler designed for both long-running services and batch jobs. Along the way, Armon explores the benefits of using schedulers for empowering developers and increasing resource utilization and how schedulers enable new next-generation application architectures.
Sneha Inguva (DigitalOcean)
Over the past year, DigitalOcean's Delivery team has been building a runtime platform based on Kubernetes with the goal of making shipping code easier. A core component of this system is a monitoring and alerting system based on Prometheus and Alertmanager. Sneha Inguva offers an overview of the system and shares problems encountered, potential solutions, and key lessons learned in the process.
Peter Alvaro (UC Santa Cruz)
Lineage-driven fault injection (LDFI), a novel approach to automating failure testing, can greatly reduce the number of faults that must be explored via fault injection. Peter Alvaro explores LDFI’s theoretical roots in the database research notion of provenance and presents early results from the field and opportunities for near- and long-term future research.
Samir Jafferali (Linkedin)
With members in every corner of the world, LinkedIn has built services around six CDNs, numerous PoPs, and three DNS platforms. Samir Jafferali explains how LinkedIn uses big data to steer DNS intelligently, optimizes the CDNs for performance, mitigates DDoSes, and measures metrics using RUM and synthetic monitoring and shares best practices that edge teams of all sizes can benefit from.
Dawn Parzych (Catchpoint)
Human perception and biases can influence how metrics are interpreted. While valid metrics can open lines of communication across and within teams, using vanity metrics or data to shame others can be counterproductive. Dawn Parzych explains how you can make a real and lasting impact on your organization by understanding the influence assumptions and biases have and how to present credible data.
Brendan Gregg (Netflix)
Advanced performance observability and debugging has arrived in Linux 4.x, with enhanced BPF (eBPF). Brendan Gregg offers an overview of Linux's new dynamic and static tracing tools for the analysis of filesystems, storage, CPUs, TCP, and more. Join in to explore a new generation of tools and visualizations.
Suman Karumuri (Pinterest)
Distributed tracing is an emerging field of monitoring distributed systems. Suman Karumuri shares the challenges of building and deploying distributed tracing at scale using PinTrace, one of the largest distributed tracing pipelines. Drawing on real-world examples, Suman explains how traces can be used to understand, debug, and optimize your production workflows.
Jasmin Nakic (Salesforce ), Samir Pilipovic (Salesforce)
Jasmin Nakic and Samir Pilipovic examine the application of a linear regression predictive model on time series performance data, discussing and evaluating different models to find the optimal choice for a given dataset. All steps will be supported with Python-based scripts so that you can easily implement similar models on your own data.
Dave Andrews (Verizon Digital Media Services)
Cascading failures are every team's worst nightmare. Without the right monitoring, alerting, and containment in place, the failure of a system's key part can quickly result in the entire system failing. Dave Andrews shares strategies for addressing cascading failures at various scales, on a single system, within a given data center and in a globally distributed environment.
Phillip Liu (SignalFx)
Phillip Liu explores the one thing that has become a driver of ever better engineering: constant removal of friction for engineers to not only build and ship code but also understand how code is used and how it works and operates. The end result is a culture that promotes many possible ways to address given challenges and surfaces novel approaches, which may have never arisen otherwise.
Today we depend upon service providers (for storage, compute, network, DNS, CDN, and much more) to build and deliver our applications. Even when the most sophisticated service providers on the internet fail—and they do—it’s still possible to build resilient applications. Kristopher Beevers explores how ops teams and developers are thinking about resiliency in a service provider world.
Pete Cheslock (Threat Stack)
Pete Cheslock shares the operational and security practices that helped Threat Stack scale while staying stable and secure, covering technology and tools and the various scale points that forced hard decisions.
Adam Shepard (AudienceScience)
Adam Shepard peels back the covers on a user delivery network—a worldwide distributed data store powering over 80 billion transactions a day at millisecond speed. Join in to learn about eventually consistent data architectures, tiered and hybrid storage layers, and what it takes to manage that much data at scale.
Ranjeeth Karthik Selvan Kathiresan (Salesforce), Gurpreet Multani (
Even though HBase is considered a highly scalable distributed solution, there are cases where the schema design of HBase tables or the way a client uses an HBase cluster may impact the scalability factor of HBase. Ranjeeth Karthik Selvan Kathiresan and Gurpreet Multani outline the most important things to consider when scaling your HBase cluster to accommodate high-volume and high-velocity data.
Sébastien Goasguen (TriggerMesh)
Kubernetes has emerged as one of the leading container orchestrators. Sebastien Goasguen explores its architecture and compares it with other orchestration/scheduling systems, outlining the similarities and explaining why Kubernetes API primitives make all the difference.
Dharmesh Kakadia (Microsoft)
Orchestration systems all have different design trade-offs. Despite best efforts, these choices are not always clear to developers using these systems. Dharmesh Kakadia describes the fundamentals of scheduling and explores the scheduling algorithms implemented by various orchestration systems, highlighting similarities, differences, and the consequences of the design choices for the users.
Timothy Gross (Joyent)
Conway's law tells us that "organizations which design systems. . .are constrained to produce designs which are copies of the communication structures of these organizations." What if we turn Conway's law around? Timothy Gross explores how to make technology choices that improve the culture of your organization.
Emil Stolarsky (Shopify), Justin Li (Shopify)
Once reserved for companies large enough to write a load balancer from scratch, load balancer middleware can be a powerful tool for scaling applications. Emil Stolarsky and Justin Li explain how Shopify uses scriptable load balancers to solve difficult infrastructure problems, such as sharding across data centers, handling flash sales, and responding quickly to DDoS attacks.
Juan Pablo Buriticá explains how to use technical RFCs as a decision-making tool in your engineering organization to increase effectiveness. When implemented properly, technical RFCs can encourage trust and delegation, respectful discussions, knowledge sharing, and accountability and support good software design.
Cliff Crocker (Akamai)
Most tools designed to help you manage your systems fall into two categories: finders, such as monitoring services and log file analyzers, or fixers, such as cloud infrastructure providers or container orchestration. It's up to you to translate information from your finders into actions for your fixers. Cliff Crocker explains how to use intelligent analytics to connect data to actions.
Christine Yen (Honeycomb)
Preaggregated metrics and time series form the backbone of many monitoring setups. They have many redeeming qualities but simply aren't sufficient for capturing or responding to the many ways things can go wrong in modern or complex systems. Christine Yen outlines the problems inherent in the use and implementation of preaggregated metrics.
Camille Fournier (Independent)
There is compelling evidence that technical workers want leaders who are strong technologists, leaders they believe they can learn from. What does this mean for those who wish to become engineering managers and technical leaders? How can you be an effective noncoding technical leader? Camille Fournier explores this conundrum and shares strategies to overcome it.
Dominic Williams (DFINITY)
DFINITY, a new kind of open cloud computing resource, takes the form of a decentralized network that conjures a performant "blockchain computer" with unbounded capacity that will act much like a gigantic shared mainframe for the world. Dominic Williams introduces the project and explores the foundational decentralized computing techniques it makes use of.
Dinesh Dutt (Cumulus Networks)
Dinesh Dutt explores network troubleshooting and explains how to avoid common network problems ranging from misconfigured cabling to misbehaving protocols, how a modern networking tool chest can help simplify network configurations, and how automation is improving troubleshooting turnaround times to minimize downtime.
Martin Woodward (Microsoft)
Martin Woodward tells the story of how Microsoft’s internal engineering systems are being transformed from a collection of disparate in-house tools built up over decades to One Engineering System with a globally distributed 24x7x365 service on the public cloud, utilizing modern techniques and industry-recognized open source technologies.