Lorin Hochstein

About:

Lorin Hochstein's ramblings about software, complex systems, and incidents.

Website:

Incoming Links:

Ahmet Alp Balkan Alex Weisberger Conner McCall Evan Smith Guilherme Ananias jerlendds Monkeynoodle.Org Philip Zucker Show more (3)

Outgoing Links:

Gwern Branwen Hillel Wayne Joel Spolsky Kent Beck Murat Buffalo Nat Bennett Patrick McKenzie ribbonfarm Show more (3)

Subscribe to RSS:

Link

2026-03-13 • reliability engineering availability github cto incident response

GitHub's CTO discusses recent availability incidents, highlighting the need for transparency, better traffic management, and the balance between reliability and security.

2025-12-06 • cybersecurity reliability cloudflare services outage notifications complex adaptive systems

The blog post discusses two recent outages experienced by Cloudflare, focusing on the complexities and risks involved in their system changes aimed at enhancing security. The author explains how a change intended to protect agains...

2025-11-27 • software development cloudflare services incident management outage notifications resilience

The author reflects on the recent Cloudflare outage while hosting a conference, emphasizing the importance of understanding complex system failures, particularly the concept of saturation in resilience engineering. The post discus...

2025-12-21 • aws reliability error correction software engineering incident reports

The author argues that the term 'Correction of Error' misrepresents incidents by implying they are solely caused by errors, limiting the potential for learning and improvement.

2026-02-01 • software development aviation safety challenges ai and it tools complex adaptive systems

Complexity in engineering and aviation is both a challenge and a necessity, as it often introduces new problems while also providing solutions to existing ones.

2025-12-21 • nokia 3310 telecommunications incident reports optum triple zero

The post critiques the Schott Review of the Optus Triple Zero outage, arguing it overlooks systemic issues and fails to understand the complexities of operational work.

2025-11-28 • data analysis cloudflare services tabletop rpg incident response statistical control

The blog post discusses the unpredictability of the time-to-resolution (TTR) metric in incident response, using Cloudflare's incident data as a case study. The author argues that TTR will never be under statistical control due to ...

2026-01-17 • philosophy public opinion storytelling narrative cultural criticism

The essay explores how narratives, like Arendt's on Eichmann and Lewis's on Bankman-Fried, can clash with societal beliefs, leading to public rejection of complex truths.

2025-12-15 • software development aws reliability incident management resilience

AWS's Craig Howard's talk on a recent incident reveals the complexities of reliability mechanisms and emphasizes the need for better incident response strategies and resilience engineering.

2025-12-27 • automation software development certificate management tls bazel

The expired SSL certificate incident for Bazel illustrates the critical risks and operational challenges associated with SSL technology and automated renewal systems.

2026-02-16 • leadership okr w. edwards deming measurement theory, management theory drucker

The post compares Peter Drucker's practical management methods with W. Edwards Deming's systemic approach, arguing that Drucker's ideas are more widely adopted in the U.S. despite Deming's deeper insights.

2026-02-09 • complexity telecommunications software engineering ai knowledge gap

The post examines the complexities of technology and the widespread ignorance about how systems function, emphasizing the risks of building without understanding.

2026-01-19 • programming parallel computing scalability software engineering ai

The post examines scaling strategies in software operations and the impact of AI coding agents on productivity and human coordination in programming.

2026-02-22 • automation api reliability cloudflare services incident management

Cloudflare's February 20 incident stemmed from a bug in an automation process, leading to unintended deletion of active prefixes and highlighting operational risks.

2026-03-07 • openai anthropic reliability large language models ai

AI companies are facing reliability issues due to rapid user growth, necessitating improvements in capacity management and service resilience.

2026-02-14 • software development team management and development site reliability engineering incident response ai

AI SRE tools excel in diagnostics but fall short in incident management, highlighting the need for human coordination and diverse perspectives in resolving software incidents.

2026-01-18 • human factors dashboard information processing moylan arrow srk model

The post connects the contributions of James Moylan and James Reason to human error and information processing, highlighting the relevance of Rasmussen's SRK model in contemporary dashboard design.

2025-11-28 • technical issues aws technical analysis cloudflare services incident management

The post analyzes Cloudflare's public incident data, revealing an average of nearly two incidents reported per day from January 1 to November 27, 2025. It contrasts this with AWS, which reports significantly fewer incidents, sugge...

2026-02-07 • software development healthcare variability safety ai

Normal variability in human performance is essential for safety and adaptability, contrasting with traditional views that see it as a liability, especially in the context of AI in software development.

2025-12-28 • decision making anxiety intuition psychology anesthesia

Dworkin's essay explores how anxiety can manifest as a loss of intuition in high-stakes medical decision-making, ultimately leading to a deeper understanding of both concepts.

2026-01-25 • management meetings workplace culture coordination efficiency metrics

Coordination costs in large organizations lead to inefficiencies, such as silos and excessive meetings, necessitating better management of collaborative efforts.

2026-01-16 • risk management outage notifications telecommunications network management verizon, at&t

The blog post predicts that Verizon's outage will be linked to planned maintenance and human error, drawing parallels with previous telecom outages.

2025-12-24 • waymo autonomous vehicles saturation robotaxi safety

Waymo's robotaxis struggled during a power outage, revealing the challenges of system saturation and the need for improved safety protocols in autonomous navigation.

2025-12-23 • difficulty incident management incident response learning potential

Introducing a learning potential rating alongside severity ratings can enhance incident management by prioritizing insightful analyses of lower severity incidents.