Build & maintain complex distributed systems
October 1–2, 2017: Training
October 2–4, 2017: Tutorials & Conference
New York, NY

Speaker slides

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Bart De Vylder (CoScale), Pieter Buteneers (CoScale)
Data science is a hot topic. Bart De Vylder and Pieter Buteneers offer a practical introduction that goes beyond the hype, exploring data analysis, visualization, and machine learning techniques using Python for modeling the behavior of distributed systems. You'll leave with a solid starting point to implement data science techniques in your infrastructure or domain of interest.
Neha Narula (Digital Currency Initiative)
Bitcoin showed us a new way of moving value around the internet without intermediaries. Neha Narula explains how this paradigm might apply to our traditional ways of thinking about databases that cross organizational boundaries. As data on the web becomes consolidated around a few key players, the blockchain might help users gain more control.
Tom Adams (ThoughtWorks)
Containerization has launched a new wave of software deployment models, but do our philosophies for building, testing, and deploying software still hold true? Tom Adams walks you through creating a build pipeline for Docker images that is rooted in continuous integration (CI) practices.
Mark McBride (Turbine Labs)
With the recent flourishing of observability systems, there's no shortage of things to monitor. Sadly, humans have limited capacity to process them all. Mark McBride outlines three key metrics—request rate, success rate, and the latency histogram—that provide a high-level abstraction of the customer experience. If these three metrics are good, your system is healthy from a customer perspective.
Terran Melconian (Air Network Simulation and Analysis)
Terran Melconian explores an organized process for observing a misbehaving complex system, reasoning about possible causes, and isolating the fault. While it is not generally taught, all the successful senior engineers with operational experience Terran has talked to use a variant of this process.
Phil Lombardi (Datawire), Rafael Schloming (Datawire), Richard Li (Datawire)
Microservices are an increasingly popular approach to building cloud-native applications, and dozens of new technologies that streamline adopting microservices development, such as Docker, Kubernetes, and Envoy, have been released over the past few years. Phil Lombardi, Rafael Schloming, and Richard Li walk you through actually using these technologies to develop, deploy, and run microservices.
Dina Goldshtein (Riverbed)
Event Tracing for Windows (ETW) is the most important diagnostic tool Windows developers have at their disposal. Dina Goldshtein explores the rich and wonderful world of ETW events, which span numerous OS components. You’ll learn how to diagnose complex issues in production systems and discover ways to automate ETW collection and analysis to build self-diagnosing applications.
During active operational incidents, we experience very human reactions that get in the way of resolution. Approaches like Incident Command provide solid foundations for incident response. Kristopher Beevers explains how to augment Incident Command with simple tools and processes that help your team focus, communicate effectively, and respond calmly and precisely during mission-critical events.
Rob Dickinson (resurface.io)
On the surface, adapting software to use persistent memory seems obvious. After all, persistent memory is simply fast memory that maintains state when the power goes out, like an SSD. But unlike SSDs, persistent memory challenges long-held ideas and conventions about how software works. Rob Dickinson outlines four key ideas that will help focus your persistent memory strategy.
Julien Simon (AWS)
FPGAs have become a hot topic in the IT industry, thanks to the unprecedented computing power that they bring to demanding HPC applications, and AWS recently introduced FPGA-powered instances (aka F1 instances) to make the process simpler and quicker. Julien Simon walks you through building an FPGA-enabled application, from design to simulation to synthesis to execution on an F1 instance.
Bryan Liles (Heptio), Yuri Shkuro (Uber Technologies), Won Jun Jang (Uber), Prithvi Raj (Uber)
Yuri Shkuro, Bryan Liles, Won Jun Jang, and Prithvi Raj walk you through implementing distributed tracing in modern applications, using the CNCF’s OpenTracing project. You'll explore a set of sample applications and learn how to instrument them for tracing. You'll also use a tracing system such as Jaeger, Zipkin, or LightStep to visualize complex transactions that might span multiple processes.
Claire Le Goues (Carnegie Mellon University)
Claire Le Goues shares recent advances in academic software engineering and programming languages research that aims to bring that dream to reality, using everything from metaheuristic search to program synthesis to machine learning and search over big databases of existing code to make it happen.
Swaminathan Sundaramurthy (Salesforce Inc), Mark Cho (Pinterest)
Pinterest has to support real-time decision making while operating on petabyte-scale data. Swaminathan Sundaramurthy and Mark Cho offer an overview of Pinterest's real-time data pipeline (modeled on quasi-Kappa architecture), its impact on the company's systems, and tools and processes used and demonstrate how Pinterest models real-time ads analytics on the platform.
Matt Cutts (United States Digital Service (USDS))
In government, you can still find out-of-date tech practices like writing requirements for years or launching systems without monitoring. The government wants more effective technology. Meanwhile, everyone else wants a more effective government. Matt Cutts discusses how better technology can improve not just software systems but also trust in government itself.
Kelly Looney (Skytap)
Kelly Looney shares an incremental approach to introducing containers into complex, distributed applications—resulting in modernization with less risk and more reward. You’ll learn how to evaluate which components of your applications are best suited for containers, how to experiment safely and get fast feedback, and how to increase and scale your container adoption.
Susie Xia (LinkedIn), anant Rao (LinkedIn)
Susie Xia and Anant Rao explain how LinkedIn leverages live production traffic to determine service and resource bottlenecks at scale with a tool called Redliner and how you can use your current architecture to do the same.
Baron Schwartz (VividCortex)
Observability (or lack thereof), like testability and maintainability, is a fundamental property of systems. But what does observable code look like? What instrumentation creates systems that are observable later in arbitrary ways, in circumstances you can't foresee? Baron Schwartz outlines the most useful things to know about observability in systems in production.
Sébastien Goasguen (TriggerMesh)
Kubernetes, one of the highest velocity projects on GitHub, is quickly becoming the leading platform on which to build distributed applications. Sebastien Goasguen offers a Kubernetes primer, covering the architecture of a Kubernetes installation, the API objects that make up a distributed application on Kubernetes, and more.
Ignat Korchagin (Cloudflare)
Ever wondered how to quickly and efficiently rollover all of your servers’ SSH keys or how to securely manage diskless systems? Ignat Korchagin outlines a simple approach that combines hardware support and a little cryptography to help operationalize the management of all the secrets in your cloud.
Seth Vargo (Google)
It’s great that you’ve moved to microservices, but how are you distributing secrets? Seth Vargo offers an overview of Vault’s unique approach to secret management by providing secrets as a service for your services (and your humans too), which is highly scalable and easily customizable to fit any environment.
As the systems we build become more distributed and (in the case of containerization) ephemeral, traditional monitoring tools prove to be grossly insufficient. Fortunately, the state of monitoring has evolved to meet these new demands, but it brings its own set of technical and organizational challenges. Cindy Sridharan offers an honest overview of monitoring challenges and trade-offs.
Zhenzhong Xu (Netflix)
Keystone, a critical piece of Netflix's backend data infrastructure, ensures massive data movements and real-time event processing. Zhenzhong Xu leads a deep dive into Keystone's architecture and underlying stream processing engines, sharing insights and proven paths on how the company achieves multitenancy, scalability, and resilience in a complex cloud-native distributed system environment.
Arshan Dabirsiaghi (Contrast Security)
Arshan Dabirsiaghi explains what Contrast Security learned from the Struts 2 exploit and details how to stop the next attack against your production apps.
Karthik Kirupanithi (Amazon Web Services)
Voice UIs like Amazon's Alexa can make systems management simple, intuitive, and delightful. The virtual private assistant feel of a VUI, coupled with the abstraction that voice commands bring, break the tedium of management tasks. Karthik Kirupanithi demonstrates how to put together an Alexa skill that can perform tasks using the EC2 Systems Manager.
We like to think that technology can make the world a better place, but we (conveniently) forget how it can make it worse. Primum non nocere (first do no harm) is the first concept taught in medical school, serving as a reminder of the possible harm that any intervention might do. Cynthia Savard Saucier challenges the tech industry to come up with its own fundamental principle.
Lex Neva (Fastly)
When the DDoS attack crushed Dyn last October, did your DNS fail? Heroku's sure did. In response, Lex Neva deep dove into everything DNS to learn how to implement resilient DNS properly—reading RFCs, asking questions of pros, and performing real-world experiments when no one knew the answers. Join Lex to find out what does work and all the crazy details of DNS that he uncovered.
Kevin Beck (New Relic)
New Relic customers send monitoring data to New Relic servers every minute—a continuous firehose of data. Drawing on his experience at New Relic, Kevin Beck shares best practices for building a streaming service based on Apache Kafka, self-monitoring for reliability and fault tolerance, and building a DevOps culture that anticipates and prevents outages.
Carin Meier (Cognitect)
As technology advances, our systems are growing more and more complex, reaching the threshold of what we can handle and even comprehend. We need more than tools to keep it under control. We need new ways of thinking. Carin Meier explores new ways to approach systems and tame complexity for the rapidly changing future.
Joe Goldberg (BMC Software)
Business transformation has led us to adopt new technologies and process and cultural changes. How batch application automation is built, tested, and run must evolve to keep pace. Joe Goldberg explores jobs as code, which looks at batch application automation from an SDLC perspective—an approach that embeds expectations within a modern automation platform.