Finding the right balance between writing custom in-house software and using an off-the-shelf solution is difficult. Aish Raj Dahal sheds light on the age old build versus buy problem and "not invented here syndrome" by explaining how PagerDuty built a distributed task scheduler and later moved off it to use an off-the-shelf open source solution.
Good test coverage is essential for catching issues before a pull request has been merged, but they have to be the right kind of tests and must be reliable. Drawing on his experience at Microsoft, Sam Guckenheimer details what type of tests to do in your DevOps pipeline, when you should do them, and why.
Beyond looking out for a little green padlock in the browser bar, what do you need to know about secure connections as a programmer? What do people mean by terms like authentication, verifying a certificate, or signing a message? Join Liz Rice as she demystifies HTTPS, TLS, X.509, and more.
Ansible is a "batteries included" automation, configuration management, and orchestration tool that's fast to learn and flexible enough for any architecture. Join James Meickle to get started with Ansible, with an eye toward sustainable development in cloud environments.
Traditional security approaches to threat and risk management are highly optimized to work within a traditional software development lifecycle. Michael Brunton-Spall shares a new approach to reviewing systems along with real-life examples to help you prioritize where to focus security efforts and what sorts of security threats you should worry about.
Multiregion deployments can improve availability and latency and can cost way less than you think. Aaron Blohowiak dives into his experience operating in multiple regions at scale at Netflix and shares the algebraic models, code, and incident management playbooks the company has developed to tame, refine, and leverage its approach.
Accidental architecture is a product of circumstances rather than deliberate development toward a goal. James Thompson explains why it's best addressed by equipping teams to make more deliberate and informed technical decisions.
Implementing site reliability (SRE) engineering doesn't have to be intimidating, and it isn't only for cloud-native organizations. Liz Fong-Jones and Dave Rensin share eight key lessons Google's customer reliability engineering team learned helping large enterprises adopt SRE as an operations engineering model.
Mike Newswanger explains how he used Kubernetes and Google Cloud to burst and extend the capacity of a physical infrastructure for optimizing almost 10 million images in less than two weeks.
Evolving teams and evolving companies are a constant in the career of a leader; helping your team navigate through that change becomes critical to your success as a manager and for the organization. Rocio Delgado shares dos and don'ts for managing and communicating change in your team or organization, which may highlight where your own skills need to evolve.
Neil Peterson leads a technical deep dive into using the Kubernetes Service Catalog to dynamically provision and consume managed cloud services.
As our industry faces its biggest reckoning ever with the social, ethical and cultural impacts of technology, what can we learn if we reflect on the assumptions we build into our systems? How could our processes and tools be designed to undo the biggest bugs and biases of today’s tech?
Join this talk to learn about how to curate your perfect developer experience using Kubernetes.
Waffle House's hurricane disaster plan has everything you could want from an IT disaster plan, including contact trees, failover states, and runbooks on partial operation. Heidi Waterhouse shares lessons about state drawn from the world outside computers and explains how to quantify them using a finite state machine and implement them automatically while you are in a less-than-perfect condition.
Molly Crowther demonstrates how the enterprise can use cloud platforms to make security move at the pace of business—not the other way around.
Many companies adopt microservices to break down monoliths, but they soon uncover a hidden cost: How do you manage all these new interconnected things popping up? Michael Hamrah explains how to avoid creating Frankenstein's monster by understanding elements of a microservice platform. . .so you can sleep at night.
Three years ago, technical teams at USA TODAY NETWORK were completely siloed, making improvements and troubleshooting difficult and often blind to the rest of the technical organization. Bridget Lane and Kris Vincent explain how drastically the teams' tool belts, thought processes, and goals have changed as the company moved from silos to a single pane of glass.
In the past five years, Alexander Rasmussen has spent a lot of time trying to get high-integrity data out of spreadsheets and into databases. Alexander explores common data integrity problems when dealing with spreadsheet data, investigates whether those integrity problems are inescapable, and shares ongoing work to mitigate them.
Matt Rogish explains how NTSB investigations of air disasters have dramatically improved flight safety and applies lessons learned in disaster recovery and analysis, teamwork, task saturation, and systems design to modern software application and infrastructure architecture at scale to achieve higher availability, reduced errors, and more scalable systems.
You're unsatisfied with one of your monitoring providers. You've considered finding a new solution, but the thought of migrating your data off their platform sounds extremely painful. Amy Nguyen and Cory Watson explain how to make a deadline for an infrastructure-critical software migration while ensuring that everyone's requirements are met and no data has been lost.
How do you refactor major, core functionality in a million-line codebase without disrupting the entire system? Maude Lemaire explains how Slack overhauled channels and shares the many obstacles the company overcame to boost both application performance and company-wide developer productivity (with only a few hiccups).
Developer and operator personas are often viewed as separate, but the truth on the ground is actually far more mixed. Developers often operate their own software, and operators often explore software to find and fix bugs. Brendan Burns covers this overlap, explaining how to build tooling and approaches that enable developers and operators to quickly switch or blend between the personas.
It's a Kubernetes world. Join Ryan Gregg to learn about Knative, an open source collaboration between Google and other industry leaders to define the future of serverless on Kubernetes. Knative solves the difficult but boring aspects of running modern cloud applications on Kubernetes.
Kubernetes has a reputation for being complex to set up and operate, but that doesn't have to be the case. Join Jérôme Petazzoni to explore Kubernetes concepts and architecture and learn how to use it to deploy and scale your applications. The content is suitable to all kinds of deployment models, from the cloud (AKS, EKS, GKE, kops, etc.) to on-premises.
As Kubernetes enters the mainstream market, we are seeing more use cases that don't fit the original mold, each bringing a new set of challenges. Ian Crosby discusses three specific case studies, the challenges encountered adopting Kubernetes, and the solutions and tooling used to solve them.
Christian Monaghan explains how he and his team successfully migrated HealthCare.gov, America's largest government website, to the cloud infrastructure provisioning tool Terraform, shares lessons learned along the way, and details how you can effectively use Terraform for your next project.
When building distributed applications, it's highly desirable to maintain a single source of truth, such as a database, for all application state. Unfortunately, for some applications, multiple sources of truth are unavoidable. Adam Wolfe Gordon shares strategies, learned from real-world experience, for managing multiple sources of truth without sacrificing consistency and usability.
Spotify recently completed the migration of all services from running on bare-metal hardware to hosts in the cloud on GCP. Spotify is now in the exciting process of journeying from merely cloud hosted to cloud native via migrating the running of services to Kubernetes. James Wen discusses the work involved, lessons learned, and pitfalls encountered in moving services onto Kubernetes.
Machine learning has revolutionized many fields, from cancer detection to self-driving cars. And let's not forget about connected toilets that allow Alexa to flush at your command. Francesc Campoy Flores explores some of the techniques used and the most relevant research, focusing on use cases where machine learning can help developers be more efficient.
In 2007, Pat Helland published "Life Beyond Distributed Transactions: An Apostate’s Opinion," in which he conducts a thought experiment on how to design a distributed database that can scale almost infinitely. While the paper explicitly addresses distributed database design, Sean Allen shows that the ideas are far more widely applicable, particularly in scaling stateful applications.
Automated anomaly detection in production using simple data science techniques enables you to more quickly identify an issue and reduce the time it takes to get customers out of an outage. Tuli Nivas shows how to apply simple statistics to change how performance data is viewed and how to easily and effectively identify issues in production.
Performance theory offers a rigorous and practical approach to performance tuning and capacity planning. Kavya Joshi dives into elegant results like Little’s law and the Universal Scalability Law. You'll also discover how performance theory is used in real systems at companies like Facebook and learn how to leverage it to prepare your systems for flux and scale.
Rewriting the key software component of your platform from scratch is always intimidating. Shannon Weyrick and James Royalty discuss NS1's recent DNS server rewrite and outline the steps the company took to roll it out across its globally distributed network with no downtime.
Now that adoption is ramped up and HTTP/2 is being regularly used on the internet, it's a good time to revisit the protocol and its deployment. Hooman Beheshti reviews protocol basics and digs into core features such as interaction with TCP, server push, priorities and dependencies, and HPACK.
Quantopian integrates financial data from vendors around the globe. As the scope of its operations outgrew cron, the company turned to Apache Airflow, a distributed scheduler and task executor. James Meickle explains how in less than six months, Quantopian was able to rearchitect brittle crontabs into resilient, recoverable pipelines defined in code to which anyone could contribute.
Serverless architectures remove load from web servers and scale flawlessly to handle any volume while keeping you from paying for an instant of wasted idle time. Bill Boulden walks you through creating a functioning serverless API that coexists alongside conventionally served web pages using AWS Lambda and API Gateway.
Jamie Wilkinson offers a brief overview of SLOs, shares a practical guide to implementing sustainable SLO-based alerting for systems of any size, and outlines the tooling required to supplement the system in the absence of cause-based alerting.
Effie Mouzeli explains why small-scale engineering is just as challenging as large-scale engineering and offers ideas on how to survive technical debt, poor communication, and other everyday challenges.
Over the past year, service meshes have gained significant interest. Most service meshes have two components: a control plane and a data plane. Anubhav Mishra explains what it takes to build a scalable control and data plane. Anubhav also discusses how HashiCorp Consul provides many features like a distributed key-value store and service discovery that make it ideal for a control plane.
Slack’s rapid growth over the last few years outpaced the original database’s scaling capacity, which negatively impacted the company's customers and engineers. Ameet Kotian explains how a small team of engineers embarked on a journey for the right database solution, which eventually led them to Vitess, an open source cluster database.
Technical interviewing is profoundly important, but unfortunately, it's easy to do poorly and very difficult to do well. Moishe Lettvin outlines strategies for reducing bias and increasing the fidelity of your technical interviews.
The Financial Times recently migrated its content platform to Kubernetes. Join Sarah Wells to find out what it takes to migrate 150+ microservices from one container stack to another without affecting the existing production users and while the rest of your teams are working on delivering new functionality.
In critical path services such as DNS, stability is imperative above all else. Kris Beevers examines the trade-offs between risk and velocity faced by any high-growth, critical path technology business.
Getting traffic into a Kubernetes cluster should be simple, but it’s not. The range of options can be confusing, and implementing effective configuration is equally challenging. Richard Li discusses the evolution of ingress on Kubernetes, explains why ingress controllers aren’t necessarily the best approach, and shares a series of lessons learned about managing traffic ingress.
Priyanka Sharma and Yuri Shkuro demonstrate how distributed tracing works and how to employ it in the development and operations of your applications in the programming language of your choice: Java, Go, Python, Node.js, C#, or C++.
Michael Hausenblas walks you through troubleshooting applications running in Kubernetes, from application-level debugging to distributed tracing to chaos engineering.
Naoman Abbas offers an overview of tools Pinterest built to process trace data and the use cases they’ve enabled and shares some real-world examples. Join in to learn how to apply these techniques to your own challenges.
Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads.
Aviran Mordo discusses the challenges and real-life use cases of handling data in a distributed environment.
Getting Kubernetes up and running is only half the battle. Now you need to get the supporting infrastructure in place. Dan Mennell shares a templated approach to deploying what is needed to get started with source control, CI/CD, and monitoring with Prometheus, along with other things.