Building and maintaining complex distributed systems
June 19–20, 2017: Training
June 20–22, 2017: Tutorials & Conference
San Jose, CA
 
230 B
Add How to scale a distributed system to your personal schedule
11:25am How to scale a distributed system Henry Robinson (Cloudera)
Add Building distributed systems is accessible. I promise. to your personal schedule
2:10pm Building distributed systems is accessible. I promise. Jamie Winsor (Chef Software)
Add The verification of a distributed system to your personal schedule
3:40pm The verification of a distributed system Caitie McCaffrey (Twitter)
LL20 A/B
Add The problem with preaggregated metrics to your personal schedule
11:25am The problem with preaggregated metrics Christine Yen (Honeycomb)
Add PinTrace: A distributed tracing pipeline to your personal schedule
1:15pm PinTrace: A distributed tracing pipeline Suman Karumuri (Pinterest)
Add Observability in a dynamically scheduled world to your personal schedule
2:10pm Observability in a dynamically scheduled world Sneha Inguva (DigitalOcean)
Add Performance analysis superpowers with Linux eBPF to your personal schedule
3:40pm Performance analysis superpowers with Linux eBPF Brendan Gregg (Netflix)
Add Our many monitoring monsters to your personal schedule
4:35pm Our many monitoring monsters Megan Anctil (Slack)
LL21 A/B
Add The road to chaos to your personal schedule
11:25am The road to chaos Nora Jones (Netflix)
Add When it absolutely, positively has to be there: Reliability guarantees in Kafka to your personal schedule
3:40pm When it absolutely, positively has to be there: Reliability guarantees in Kafka Gwen Shapira (Confluent), Jeff Holoman (Cloudera)
Add Precision chaos to your personal schedule
4:35pm Precision chaos Aaron Blohowiak (Netflix)
LL21 C/D
Add Traffic shifts: Avoiding disasters at scale to your personal schedule
11:25am Traffic shifts: Avoiding disasters at scale Michael Kehoe (LinkedIn), Anil Mallapur (LinkedIn)
Add Git as a multisite service to your personal schedule
3:40pm Git as a multisite service Patrick Reynolds (GitHub)
LL21 E/F
Add Zero Trust networks: Building systems in untrusted networks to your personal schedule
2:10pm Zero Trust networks: Building systems in untrusted networks Douglas Barth (Stripe), Evan Gilman (N/A)
Add Incident Command: The far side of the edge to your personal schedule
4:35pm Incident Command: The far side of the edge Maarten Van Horenbeeck (Fastly), Lisa Phillips (Fastly)
LL20 C
LL20 D
Add Scaling HBase for big data (sponsored by Salesforce) to your personal schedule
11:25am Scaling HBase for big data (sponsored by Salesforce) Ranjeeth Karthik Selvan Kathiresan (Salesforce), Gurpreet Multani (Salesforce.com)
Add Thursday opening welcome to your personal schedule
Grand Ballroom 220
8:55am Thursday opening welcome Mary Treseler (O'Reilly Media), Ines Sombra (Fastly), James Turnbull (Empatico)
Add Looking back to move forward to your personal schedule
9:55am Looking back to move forward Dianne Marsh (Netflix)
Add Cloud-native development: You're doing it wrong (sponsored by Oracle) to your personal schedule
10:15am Cloud-native development: You're doing it wrong (sponsored by Oracle) Micha Hernandez van Leuffen (Wercker)
Add  Performance is about people, not metrics to your personal schedule
10:25am Performance is about people, not metrics Tammy Everts (SpeedCurve)
Add Closing remarks to your personal schedule
10:45am Closing remarks
10:50am Morning Break sponsored by CA Technologies | Room: Exhibit Hall
Add Thursday lunch and Birds of a Feather sessions  to your personal schedule
12:05pm Event Thursday lunch and Birds of a Feather sessions | Room: Exhibit Hall
2:50pm Afternoon Break sponsored by Verizon Digital Media Services | Room: Exhibit Hall
8:00am Morning Coffee sponsored by Atlassian | Room: Grand Ballroom 220 Foyer
Add Thursday Speed Networking  to your personal schedule
8:15am Thursday Speed Networking | Room: Grand Ballroom 220 Foyer
Add Closing Reception (sponsored by SpeedCurve) to your personal schedule
5:15pm Closing Reception (sponsored by SpeedCurve) | Room: San Jose Ballroom, Marriott Hotel
11:25am-12:05pm (40m) Distributed Systems Distributed Data & Systems
How to scale a distributed system
Henry Robinson (Cloudera)
It seems like everyone is building a distributed system. However, there's no common body of knowledge about how these systems should be built and scaled, beyond what is squirreled away in various academic papers. Henry Robinson shares lessons learned from over eight years spent building distributed systems and outlines a framework for thinking about distributed scaling challenges.
1:15pm-1:55pm (40m) DevOps & Tools Cloud, Serverless computing
Lessons learned from operating a serverless-like platform at scale
Sangeeta Narayanan (Netflix)
Netflix operates a customizable API that allows the creation of optimized experiences on 1,000+ devices by providing developers a serverless-like platform and experience. Sangeeta Narayanan shares lessons learned operating and scaling the platform over the years and Netflix's approaches to some of the challenges it faced.
2:10pm-2:50pm (40m) Distributed Systems DevOps, Distributed Data & Systems
Building distributed systems is accessible. I promise.
Jamie Winsor (Chef Software)
Understanding and building distributed systems can be a daunting task, but like most other software development patterns, distributed systems mimic concepts in the real world that you're already familiar with. Jamie Winsor walks you through building a mental model to help you understand the basics of building distributed systems based on concrete, real-world systems.
3:40pm-4:20pm (40m) Distributed Systems Distributed Data & Systems
The verification of a distributed system
Caitie McCaffrey (Twitter)
Testing and verifying distributed systems is critically important. Caitie McCaffrey shares strategies for proving a distributed system is correct, including both formal methods and more practical forms of testing, such as fault injection and property-based testing, ensuring you are confidant that your systems are doing the right thing.
4:35pm-5:15pm (40m) Distributed Systems Containerization, Security
No place like home: Building resilient distributed systems locally in Africa
Simon de Haan (Praekelt.org)
Developing reliable healthcare systems requires careful integration of a country’s health, tech, and legal ecosystems. In Africa, locally built resilient distributed systems are needed to meet the demand of national-scale digital health services and data sovereignty laws. Simon de Haan explores the challenges and proven solutions building in these environments.
11:25am-12:05pm (40m) Monitoring, Tracing, & Metrics
The problem with preaggregated metrics
Christine Yen (Honeycomb)
Preaggregated metrics and time series form the backbone of many monitoring setups. They have many redeeming qualities but simply aren't sufficient for capturing or responding to the many ways things can go wrong in modern or complex systems. Christine Yen outlines the problems inherent in the use and implementation of preaggregated metrics.
1:15pm-1:55pm (40m) Monitoring, Tracing, & Metrics Cloud, DevOps
PinTrace: A distributed tracing pipeline
Suman Karumuri (Pinterest)
Distributed tracing is an emerging field of monitoring distributed systems. Suman Karumuri shares the challenges of building and deploying distributed tracing at scale using PinTrace, one of the largest distributed tracing pipelines. Drawing on real-world examples, Suman explains how traces can be used to understand, debug, and optimize your production workflows.
2:10pm-2:50pm (40m) Monitoring, Tracing, & Metrics Containerization, Orchestration and Scheduling
Observability in a dynamically scheduled world
Sneha Inguva (DigitalOcean)
Over the past year, DigitalOcean's Delivery team has been building a runtime platform based on Kubernetes with the goal of making shipping code easier. A core component of this system is a monitoring and alerting system based on Prometheus and Alertmanager. Sneha Inguva offers an overview of the system and shares problems encountered, potential solutions, and key lessons learned in the process.
3:40pm-4:20pm (40m) Systems Engineering DevOps, Distributed Data & Systems
Performance analysis superpowers with Linux eBPF
Brendan Gregg (Netflix)
Advanced performance observability and debugging has arrived in Linux 4.x, with enhanced BPF (eBPF). Brendan Gregg offers an overview of Linux's new dynamic and static tracing tools for the analysis of filesystems, storage, CPUs, TCP, and more. Join in to explore a new generation of tools and visualizations.
4:35pm-5:15pm (40m) Monitoring, Tracing, & Metrics DevOps, Distributed Data & Systems
Our many monitoring monsters
Megan Anctil (Slack)
One size definitely doesn't fit all when it comes to open source monitoring solutions, and executing generally understood best practices in the context of unique distributed systems presents all sorts of problems. Megan Anctil shares pain points and lessons learned at Slack wrangling known technologies such as Icinga, Graphite, Grafana, and the Elastic Stack to best fit the company's use cases.
11:25am-12:05pm (40m) Resilience Engineering Cloud, Distributed Data & Systems
The road to chaos
Nora Jones (Netflix)
Chaos engineering isn't always the most popular practice among your developers. Nora Jones covers the specifics of designing a chaos engineering solution and explains how to increment your solution technically and culturally, the socialization and evangelism pieces that tend to get overlooked in the process, and how to get developers excited about purposefully injected failure.
1:15pm-1:55pm (40m) Resilience Engineering Cloud, Orchestration and Scheduling
The service mesh: Distributed resilience for a cloud-native world
Oliver Gould (Buoyant)
Modern application architecture is becoming cloud native: containerized, "microserviced," and orchestrated. But resilience is more than just Docker and Kubernetes. Oliver Gould explains why companies like PayPal, Ticketmaster, and Monzo are adopting the service mesh model, where internal, service-to-service traffic is managed and instrumented with a mesh of load-balancing proxies.
2:10pm-2:50pm (40m) Resilience Engineering Distributed Data & Systems
Canary in a coal mine: Building infrastructure resiliency with canary data reloads
Ann Kilzer (Indeed)
Remember the old practice of the canary in the coal mine, where miners used fragile feathered friends as a failure detector for toxic gasses? In software, a canary run is a trial executed on one machine before the rest of the cluster runs. Ann Kilzer explains how Indeed created a canary service leveraging Consul’s key value store to improve the resilience of data reloads in any infrastructure.
3:40pm-4:20pm (40m) Resilience Engineering
When it absolutely, positively has to be there: Reliability guarantees in Kafka
Gwen Shapira (Confluent), Jeff Holoman (Cloudera)
Kafka provides the low latency, high throughput, high availability, and scale that financial services firms require. But can it also provide complete reliability? Gwen Shapira and Jeff Holoman walk you through everything that happens to a message, from producer to consumer, and pinpoint all the places where data can be lost if you're not careful.
4:35pm-5:15pm (40m) Resilience Engineering DevOps, Distributed Data & Systems
Precision chaos
Aaron Blohowiak (Netflix)
Chaos Monkey and Kong changed the culture around infrastructure failure, but the most common cause of downtime is service failure. Turning off an entire service in production is too risky. Aaron Blohowiak offers an overview of precision chaos techniques that verify service-level fault tolerance and reveal hidden resource constraints while minimizing potential fallout.
11:25am-12:05pm (40m) Hardware, Storage, & Capacity Planning Automation, DevOps
Traffic shifts: Avoiding disasters at scale
Michael Kehoe (LinkedIn), Anil Mallapur (LinkedIn)
LinkedIn conducts regular traffic shifts during peak hours to ensure that it has sufficient capacity to handle extra load during disaster situations. Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.
1:15pm-1:55pm (40m) Hardware, Storage, & Capacity Planning Automation
Genesis: Automating data center management with help from PXE and Chef
David Radcliffe (Shopify)
The flexibility and speed offered by cloud computing solutions have raised the bar for bare metal deployments. Automation is essential to speedy, reliable provisioning and capacity management. David Radcliffe explores the tools Shopify uses, such as Genesis, to automate its data center and empower developers to move quickly and keep up with the times.
2:10pm-2:50pm (40m) Hardware, Storage, & Capacity Planning Automation, Deployment
Distributed tracing and the future of chargeback and capacity planning
Daniel Spoonhower (LightStep)
As software grows more complex, doing chargebacks and capacity planning gets more challenging. Specifically, it becomes more difficult to attribute storage and other low-level requests to high-level services. Daniel Spoonhower shows how the distributed tracing concept of context propagation can be used to overcome this problem, without any maintenance costs.
3:40pm-4:20pm (40m) Hardware, Storage, & Capacity Planning Distributed Data & Systems
Git as a multisite service
Patrick Reynolds (GitHub)
GitHub uses Spokes, a custom application-level replication system, to provide redundancy and scalable capacity for the Git service. Originally, Spokes was limited to a single physical site. Patrick Reynolds offers an overview of Spokes and explains how GitHub extended it to span multiple sites, transparently providing read-anywhere, write-anywhere replication for all Git content.
4:35pm-5:15pm (40m) Hardware, Storage, & Capacity Planning Cloud, Distributed Data & Systems
How Shutterfly migrated 10+ billion photos to the cloud
Jack Chan (Shutterfly)
Jack Chan describes how Shutterfly migrated metadata from over 10B photos from a private data center into AWS in 100 days and explores designs to absorb mountains of metadata, on-premises ecommerce integration, and parallel user experiences, all in a highly scalable fashion. Shutterfly Photos is now a hybrid cloud solution with images hosted on-premises and client-facing photos metadata on AWS.
11:25am-12:05pm (40m) Security, Systems Engineering Security
Ground truth in cyberspace: How to launch effective defenses built out of AI
Allison Miller (Google)
Automation is critical for effective operations and security ops. In large-scale systems, manual intervention has to be the exception, not the expectation. But how can security be automated, given the complexity involved? Many platforms turn to ML or AI deployed in risk models. Allison Miller discusses data-driven decision tech and explains how ML and automation creates better defenses.
1:15pm-1:55pm (40m) Security Security, Serverless computing
Serverless security: A pragmatic primer for builders and defenders
James Wickett (Signal Sciences)
Serverless is the design pattern for writing applications at scale without the necessity of managing infrastructure. It adds simplicity and a new economic model to cloud computing, but it creates some unique security challenges. James Wickett explores practical security approaches for serverless in four key areas: the software supply chain, the delivery pipeline, data flow, and attack detection.
2:10pm-2:50pm (40m) Security Automation, Security
Zero Trust networks: Building systems in untrusted networks
Douglas Barth (Stripe), Evan Gilman (N/A)
Douglas Barth and Evan Gilman offer an overview of Zero Trust, a new security model that considers all parts of the network to be equally untrusted. Doug and Evan show how to leverage a network's strengths by combining traditional SRE security approaches with novel technological arrangements while using software and hardware to secure the systems operating in those networks.
3:40pm-4:20pm (40m) Security Security
Scale it to a billion: How to build it, keep it safe, and keep it running
Pete Cheslock (Threat Stack)
Pete Cheslock shares the operational and security practices that helped Threat Stack scale while staying stable and secure, covering technology and tools and the various scale points that forced hard decisions.
4:35pm-5:15pm (40m) Security DevOps, Organizational optimization
Incident Command: The far side of the edge
Maarten Van Horenbeeck (Fastly), Lisa Phillips (Fastly)
Fastly operates the edge for many large web properties. To deal with emerging threats to its network, Fastly created a process that allows it to respond effectively to incidents: Incident Command, which rapidly coordinates teams during an incident. Maarten Van Horenbeeck and Lisa Phillips take you to the far side of the edge, demonstrating the protocols that work during an incident.
11:25am-12:05pm (40m) Sponsored
Open source tool chains for continuous testing (sponsored by CA Technologies)
Refael Botbol (CA Technologies)
The goal of continuous testing is to find defects earlier and release software faster, which can be achieved by integrating a set of open source functional and performance testing tools in the early stages of the software delivery lifecycle. Refael Botbol explains how to integrate open source tools like Apache JMeter and Selenium with Taurus and Jenkins as part of a continuous testing effort.
1:15pm-1:55pm (40m) Distributed Data & Databases Cloud, Distributed Data & Systems
Google Cloud Spanner: Global consistency at scale
Miles Ward (Google)
Google Cloud Spanner, Google's public launch of the internal Spanner service, makes available a new basic primitive for application design: globally consistent transactions. Want to know how it all works? Join Miles Ward for a detailed, demo-filled, nuanced look at the useful applications of Spanner for your workload.
2:10pm-2:50pm (40m) Sponsored
Rethink DNS for DevOps: Three ways DNS with intelligent response makes your applications better (sponsored by Oracle + Dyn)
Phil Stanhope (Oracle + Dyn)
For more than 30 years, the DNS has been one of the fundamental protocols of the internet, yet, despite its accepted importance, it has never quite gotten the due it deserves. Andy Smith explains why it's time to rethink DNS and realize the role it can play in building and running high-performance, distributed web applications.
11:25am-12:05pm (40m) Sponsored
Scaling HBase for big data (sponsored by Salesforce)
Ranjeeth Karthik Selvan Kathiresan (Salesforce), Gurpreet Multani (Salesforce.com)
Even though HBase is considered a highly scalable distributed solution, there are cases where the schema design of HBase tables or the way a client uses an HBase cluster may impact the scalability factor of HBase. Ranjeeth Karthik Selvan Kathiresan and Gurpreet Multani outline the most important things to consider when scaling your HBase cluster to accommodate high-volume and high-velocity data.
1:15pm-1:55pm (40m) Sponsored
Ensuring performance in complex architectures: Why integrated visibility is needed (sponsored by Riverbed)
Peco Karayanev (Riverbed Technology)
If you truly care about end-user experience and need to build highly scalable applications, you must stop treating your users, code, servers, and networks as independent systems. Peco Karayanev discusses a modern integrated visibility approach, where all monitoring shares a common data model that reveals issues previously hidden or misdiagnosed.
2:10pm-2:50pm (40m) Sponsored
Toward an intelligent and infinitely scalable decentralized cloud (sponsored by DFINITY)
Dominic Williams (DFINITY)
DFINITY, a new kind of open cloud computing resource, takes the form of a decentralized network that conjures a performant "blockchain computer" with unbounded capacity that will act much like a gigantic shared mainframe for the world. Dominic Williams introduces the project and explores the foundational decentralized computing techniques it makes use of.
8:55am-9:00am (5m)
Thursday opening welcome
Mary Treseler (O'Reilly Media), Ines Sombra (Fastly), James Turnbull (Empatico)
Program chairs Mary Treseler, James Turnbull, and Ines Sombra open the second day of keynotes.
9:00am-9:20am (20m)
Building cloud-native applications with Kubernetes and Istio
Kelsey Hightower (Google)
Keynote by Kelsey Hightower
9:20am-9:25am (5m) Sponsored Keynote
Preventing cascading failures in a global network (sponsored by Verizon Digital Media Services)
Dave Andrews (Verizon Digital Media Services)
Cascading failures are every team's worst nightmare. Without the right monitoring, alerting, and containment in place, the failure of a system's key part can quickly result in the entire system failing. Dave Andrews shares strategies for addressing cascading failures at various scales, on a single system, within a given data center and in a globally distributed environment.
9:25am-9:30am (5m) Sponsored Keynote
Removing engineering friction: Creating an evolutionary culture (sponsored by SignalFx)
Phillip Liu (SignalFx)
Phillip Liu explores the one thing that has become a driver of ever better engineering: constant removal of friction for engineers to not only build and ship code but also understand how code is used and how it works and operates. The end result is a culture that promotes many possible ways to address given challenges and surfaces novel approaches, which may have never arisen otherwise.
9:30am-9:50am (20m)
Lessons learned from building a globally distributed database service from the ground up
Dharma Shukla (Microsoft)
Dharma Shukla explores Azure Cosmos DB, discussing the internals of the system design and the various design trade-offs Azure had to make while building the service. Dharma also shares his experience and lessons learned operating a globally distributed database service worldwide while maintaining comprehensive service level agreements.
9:50am-9:55am (5m) Sponsored Keynote
The false dichotomy of finders versus fixers (sponsored by SOASTA, now a part of Akamai)
Cliff Crocker (Akamai)
Most tools designed to help you manage your systems fall into two categories: finders, such as monitoring services and log file analyzers, or fixers, such as cloud infrastructure providers or container orchestration. It's up to you to translate information from your finders into actions for your fixers. Cliff Crocker explains how to use intelligent analytics to connect data to actions.
9:55am-10:15am (20m)
Looking back to move forward
Dianne Marsh (Netflix)
Our industry moves fast. What does that mean for our teams, for our careers, for companies that we join or create? Beyond committing to a career of continuous learning, what does remaining relevant in tech look like? Dianne Marsh looks back at 30 years as a software professional, addressing how things have changed and how they've stayed the same.
10:15am-10:25am (10m) Sponsored Keynote
Cloud-native development: You're doing it wrong (sponsored by Oracle)
Micha Hernandez van Leuffen (Wercker)
Developing and running applications in a cloud-native world has a problem set of its own. While this paradigm solves a lot of old challenges, moving to containers and microservices and launching them at scale on schedulers such as Kubernetes requires a different approach. Micha Hernandez van Leuffen shares five best practices for developing cloud-native applications.
10:25am-10:45am (20m)
Performance is about people, not metrics
Tammy Everts (SpeedCurve)
Tammy Everts walks you through a brief history of UX and web performance research, highlighting key studies that connect the dots between performance and user experience and sharing some educated guesses about new metrics that are just around the corner.
10:45am-10:50am (5m)
Closing remarks
Program chairs Mary Treseler, James Turnbull, and Ines Sombra close the second day of keynotes.
10:50am-11:25am (35m)
Break: Morning Break sponsored by CA Technologies
12:05pm-1:15pm (1h 10m)
Thursday lunch and Birds of a Feather sessions
Birds of a Feather (BoF) sessions provide face-to-face exposure to those interested in the same projects and concepts. BoFs can be organized for individual projects or broader topics (best practices, open data, standards, etc.). BoFs are entirely up to you. We post your topic and provide the space and time. You provide the engaging topic.
2:50pm-3:40pm (50m)
Break: Afternoon Break sponsored by Verizon Digital Media Services
8:00am-8:15am (15m)
Break: Morning Coffee sponsored by Atlassian
8:15am-8:45am (30m)
Thursday Speed Networking
Meet us before the opening keynotes on Thursday morning and get to know fellow attendees in quick, 60-second discussions.
5:15pm-6:45pm (1h 30m)
Closing Reception (sponsored by SpeedCurve)
Join us for the closing celebration of Velocity and Fluent. Don’t miss this last chance to mingle.