Jessica DeVita tells the story of how a team at Microsoft challenged themselves to retrospect their retrospectives and shares what they learned about applying human factors ideas to software development.
Artificial intelligence has been almost here for 50 years, but we don't need to wait for it to escape the laboratory. Adding a manageable dose of actionable intelligence to your operations management workflow can save you time and aggravation. PagerDuty will talk about how AI's limitations and how it can decrease your noise and suggest possible courses of action.
Declarative application management enables developers and operators to simplify their configurations while deploying into increasingly complex environments. Bryan Liles explains how to evaluate and integrate these new practices into existing continuous integration pipelines.
Many fundamental security practices and controls apply to serverless applications, including implementing proper monitoring and logging of all requests and events. Luis Eduardo Colon explores recommendations published by the Center for Internet Security (CIS), explains how to automate the deployment of some of these controls, and outlines considerations relevant to serverless functions.
What insights do we gain if we apply user experience design to information security? Serena Chen shares four strategies that apply design thinking to security problems, pinpointing which practices work and which are detrimental. Serena then walks you through some common flows and dissects how design decisions affect your personal security.
Change is inevitable, but the aftereffects can be both good and bad. Having the right tools is one way to meet this challenge. Dave Andrews explains how to wield the power of a global 50 Tbps application delivery network, featuring 125+ points of presence, to ensure maximum availability during and after a change.
Ben Hartshorne and Christine Yen explore what it means for a system to be “up” by discussing end-to-end (e2e) checks (what makes a good one and what techniques are valuable when thinking about them). Along the way, you'll learn how to write and evolve an e2e check against a common API.
Engineering teams want technically competent managers, but they also often want managers to keep their hands off their code. So how can managers keep their technical skills relevant in order to add the most value? Kathleen Vignos shares creative strategies for developing and maintaining technical skills—some through the act of managing itself.
In 2016, Slack faced a problem: the load on its backend servers had increased by 1,000x. Bing Wei explains how rearchitecting the system with lazy loading, a publish/subscribe model, and an edge cache service overcame the problem with zero downtime, improved latency, and led to gains in reliability and availability.
Christian Saide explains how NS1 was able to reduce infrastructure, maintenance, and operational costs while simultaneously increasing throughput and visibility of key metrics by leveraging Elasticsearch as a time series database.
Baron Schwartz demonstrates how to monitor a database by understanding the difference between workload and resource monitoring—and the golden signals for each.
The Beacon Network is the largest search and discovery engine of human genomic data in the world. Miro Cupak details the architecture and technologies behind the system with focus on the technical decisions that allow it to scale and disrupt the perception of genetic data.
Kyle Kingsbury offers an overview of Tesser, a Clojure library for writing commutative, parallel folds that can be chained and composed into complex single-pass reductions that are dramatically faster on multicore systems and can be transparently distributed over Hadoop.
Join Nathen Harvey to learn how to easily integrate automated tests that check for adherence to policy into any stage of your deployment pipeline, using InSpec for compliance and Chef for remediation.
Kyle Kingsbury explores anomalies in three distributed systems—Tendermint, Hazelcast, and Aerospike—and shares general strategies for correctness testing using Jepsen, a distributed system testing harness that applies property-based testing to databases to verify their correctness claims during common failure modes: network partitions, process crashes, and clock skew.
Jeff Williams explains how to layer security tools on a CI/CD pipeline without disrupting it and demonstrates a fast, effective, scalable DevSecOps pipeline using free tools.
Ian Lewis shares the easiest and best ways to improve the security of your Kubernetes clusters
When Tamar Bercovici joined Box, the entire platform was running on a single MySQL DB host fronted by a simple pool of memcached servers. Tamar details how the team has evolved the Box database stack to handle an ever-growing query load and dataset. It now comprises hundreds of servers serving millions of queries per second over hundreds of billions of data records.
Matt Torrisi demonstrates how to build domain traffic easily by enabling multiplatform DNS, covers the important criteria in assessing DNS network compatibility, and walks you through using DNS as a traffic-steering platform.
John LaBarge details how to perform lightweight mobile DevOps on GCP, including building Android applications with Container Builder, doing functional testing with Firebase Device Lab, and distributing tested artifacts through Crashlytics Beta.
Tim Koopmans explains how load testing is being reinvented for DevOps, covering where traditional load testing approaches fall short for Agile and DevOps, what’s needed to rapidly expose performance issues before they impact users, and new approaches to making load testing faster, simpler, and more realistic.
Networking with Docker and Kubernetes is a lot more complex than with traditional servers and virtual machines. Jeff Poole offers an overview of the concepts involved and explains what tuning may be required to use Kubernetes successfully.
Tomas Lin and Emily Burns walk you through building continuous delivery pipelines for deploying and promoting code across cloud virtual machines and containers using Netflix's Spinnaker continuous delivery platform.
Kyle York and Richard Lee explore Netra’s high-performance computing environment, focusing on how the company's AI and deep learning models process tens of millions of images and videos each day in a time- and cost-effective manner. Along the way, they explain what worked, what didn't, and why you need an Agile, hybrid infrastructure if you want to build an AI business at the scale of social.
Engineering managers build the strongest teams by listening to their engineers, continuously calibrating their own alerts, and driving change management based on the feedback sourced from within their engineering organization. Renee Orser explains how to monitor the human networks within your engineering teams using models similar to your distributed technology systems.
In 2007, Pat Helland published "Life Beyond Distributed Transactions: An Apostate’s Opinion," in which he conducts a thought experiment on how to design a distributed database that can scale almost infinitely. While the paper explicitly addresses distributed database design, Sean Allen shows that the ideas are far more widely applicable, particularly in scaling stateful applications.
Performance debugging is a crucial part of ensuring code is production ready, particularly as a company and its products grow. However, bottlenecks that hold these services back can be hard to identify. Christian Grabowski shares his experience debugging bottlenecks in distributed systems, at both a macro (metrics, distributed tracing) and a micro (user space and kernel space profiling) level.
Rewriting the key software component of your platform from scratch is always intimidating, especially when you guarantee 100% uptime, your platform is in the critical application delivery path, and your environment is highly distributed. Shannon Weyrick discusses NS1's recent DNS server rewrite and the steps the company took to roll it out across its globally distributed network with no downtime.
Astrid Atkinson discusses techniques for building systems that are resilient by design.
Kris Nova explores the current state of running stateful applications in Kubernetes, the tooling gaps you'll want to watch out for, and the four metrics that will help you determine if it's worth the risk.
When Julia Grace joined Slack two-and-a-half years ago, the company had fewer than 100 engineers. It's now at more than 350, and her own team grew from 10 to 50 people in 18 months. Julia shares tips and stories from the leadership front lines as she learned how to rapidly scale herself and her leadership team during a period when her job was substantially changing every six months.
Nicole Forsgren shares results and stories from four years of research to uncover the secrets and surprises of what really makes high-performing technology-driven teams and organizations.
Marcel Flores explores the design and implementation of Heteractis, the traffic management system Verizon Digital Media Services uses to turn network telemetry data into automated decisions in an automated fashion.
Manish Mehta and Torin Sandall lead a deep dive into how Netflix enforces authorization policies (“who can do what”) at scale in its microservices ecosystem in a public cloud without introducing unreasonable latency in the request path.
When the internet is not bombarding your DNS with bogus requests, it’s trying to execute malicious SQL queries and crawling your site with bots (some good, some bad). Join Kyle York to learn how to take action.
We are more mobile now than ever. Although we use our mobile devices to optimize our time and do more anytime, anywhere, our apps are still too slow and cannot cope with our fast-paced lifestyle. Javier Garza details the ingredients you need to build and deliver an amazing app your users will love.
Paul McCallick discusses how and why Nordstrom has moved to an only-production viewpoint, saving countless engineering cycles and putting effort where it matters.
Tooling is necessary for serverless and service-full applications. Donna Malayeri shares a decision framework for choosing infrastructure deployment tools, based on whether you need flexibility and control or simplicity and ease-of-use. You'll learn how to evaluate several popular cloud automation tools, including AWS SAM, Terraform, Chalice, Serverless Framework, and more.
Jasmin Nakic and Jackie Chu share techniques to identify performance challenges by analyzing production data from Salesforce and other sources and explore the AI models to predict trends, detect anomalies, and troubleshoot performance problems.
Martin Woodward leads a whistle-stop tour of Microsoft's seven-year DevOps journey, explaining why the company embarked on this transformation and what benefits it has already seen.