GitHub's CTO discusses recent availability incidents, highlighting the need for transparency, better traffic management, and the balance between reliability and security.
The blog post discusses two recent outages experienced by Cloudflare, focusing on the complexities and risks involved in their system changes aimed at enhancing security. The author explains how a change intended to protect agains...
The author reflects on the recent Cloudflare outage while hosting a conference, emphasizing the importance of understanding complex system failures, particularly the concept of saturation in resilience engineering. The post discus...
The author argues that the term 'Correction of Error' misrepresents incidents by implying they are solely caused by errors, limiting the potential for learning and improvement.
Complexity in engineering and aviation is both a challenge and a necessity, as it often introduces new problems while also providing solutions to existing ones.
The post critiques the Schott Review of the Optus Triple Zero outage, arguing it overlooks systemic issues and fails to understand the complexities of operational work.
The blog post discusses the unpredictability of the time-to-resolution (TTR) metric in incident response, using Cloudflare's incident data as a case study. The author argues that TTR will never be under statistical control due to ...
The essay explores how narratives, like Arendt's on Eichmann and Lewis's on Bankman-Fried, can clash with societal beliefs, leading to public rejection of complex truths.
AWS's Craig Howard's talk on a recent incident reveals the complexities of reliability mechanisms and emphasizes the need for better incident response strategies and resilience engineering.
The expired SSL certificate incident for Bazel illustrates the critical risks and operational challenges associated with SSL technology and automated renewal systems.
The post compares Peter Drucker's practical management methods with W. Edwards Deming's systemic approach, arguing that Drucker's ideas are more widely adopted in the U.S. despite Deming's deeper insights.
The post examines the complexities of technology and the widespread ignorance about how systems function, emphasizing the risks of building without understanding.
The post examines scaling strategies in software operations and the impact of AI coding agents on productivity and human coordination in programming.
Cloudflare's February 20 incident stemmed from a bug in an automation process, leading to unintended deletion of active prefixes and highlighting operational risks.
AI companies are facing reliability issues due to rapid user growth, necessitating improvements in capacity management and service resilience.
AI SRE tools excel in diagnostics but fall short in incident management, highlighting the need for human coordination and diverse perspectives in resolving software incidents.
The post connects the contributions of James Moylan and James Reason to human error and information processing, highlighting the relevance of Rasmussen's SRK model in contemporary dashboard design.
The post analyzes Cloudflare's public incident data, revealing an average of nearly two incidents reported per day from January 1 to November 27, 2025. It contrasts this with AWS, which reports significantly fewer incidents, sugge...
Normal variability in human performance is essential for safety and adaptability, contrasting with traditional views that see it as a liability, especially in the context of AI in software development.
Dworkin's essay explores how anxiety can manifest as a loss of intuition in high-stakes medical decision-making, ultimately leading to a deeper understanding of both concepts.
Coordination costs in large organizations lead to inefficiencies, such as silos and excessive meetings, necessitating better management of collaborative efforts.
The blog post predicts that Verizon's outage will be linked to planned maintenance and human error, drawing parallels with previous telecom outages.
Waymo's robotaxis struggled during a power outage, revealing the challenges of system saturation and the need for improved safety protocols in autonomous navigation.
Introducing a learning potential rating alongside severity ratings can enhance incident management by prioritizing insightful analyses of lower severity incidents.