Wolf Incident Postmortem
post by jefftk (jkaufman) · 2023-01-09T03:20:03.723Z · LW · GW · 13 commentsContents
Incident #210 Status Summary Impact Root causes Trigger Resolution Detection Action Items Lessons Learned What went well What went wrong Where we got lucky Timeline None 13 comments
Incident #210
Status
Complete, one action item outstanding.Summary
Sentinel consumed by wolf after repeated false alarms.Impact
Loss of sentinel. No flock impact.Root causes
Sentinel generated noisy alerts due to premature deployment, incomplete training, and overly monotonous task. Oncalls failed to respond to true positive due to alert fatigue.Trigger
Wolf.Resolution
Gathered flock. Deployed replacement sentinel.Detection
Sentinel did not report at end of shift.Action Items
Priority | Action Item | Type | Status |
---|---|---|---|
P0 | Gather flock | mitigate | complete |
P0 | Deploy replacement sentinel | mitigate | complete |
P1 | Update playbook for wolf alerts | prevent | complete |
P2 | Update remaining sentinels | prevent | complete |
P2 | Revise sentinel training program | prevent | complete |
P2 | Investigate equipping sentinels with flutes or slings | prevent | in progress |
Lessons Learned
What went well
- Flock gathering proceeded without issues.
- No flock injuries or losses.
- Replacement sentinel did not exhibit false positive alerts.
What went wrong
- Noisy alerts not addressed.
- Alerts silenced contrary to playbook.
- Loss of sentinel.
Where we got lucky
- Only one wolf.
- Wolf sated after sentinel consumption.
- Replacement sentinel available.
Timeline
All times localMarch 3rd:
- 16:32 Oncalls paged "wolf".
- 16:34 First oncall arrives at sentinel location.
- 16:34 Alert diagnosed as false positive. No corrective action performed.
March 4th:
- 14:15 Oncalls paged "wolf".
- 14:19 First oncall arrives at sentinel location.
- 14:19 Alert diagnosed as false positive. No corrective action performed.
March 5th:
- 17:03 (Reconstructed) Outage begins, sentinel notices wolf.
- 17:03 Oncalls paged "wolf".
- 17:04 Oncalls paged "wolf".
- 17:04 Oncalls paged "real wolf".
- 17:05 (Reconstructed) Wolf consumes sentinel.
- 18:45 Sentinel does not report at end of shift.
- 19:05 Primary oncall dispatched to field.
- 19:10 Oncall diagnoses issue.
- 19:10 Incident begins, secondary and tertiary oncalls paged.
- 19:15 First sheep located.
- 19:52 Last sheep located.
- 20:05 Flock safe in pens.
- 20:05 Outage ends, flock protection fully restored.
- 20:45 Replacement sentinel identified.
- 07:38 Replacement sentinel deployed
- 18:45 Replacement sentinel reports at end of shift
- 18:45 Incident ends, 24hr without wolf alerts or activity (exit criterion).
Comment via: facebook, mastodon
13 comments
Comments sorted by top scores.
comment by jimrandomh · 2023-01-09T16:45:13.854Z · LW(p) · GW(p)
This historical incident report fails to mention the true root cause, which has since been addressed: Wolves were not yet locally driven to extinction.
Replies from: jkaufman, lahwran↑ comment by jefftk (jkaufman) · 2023-01-09T17:46:41.874Z · LW(p) · GW(p)
I thought the true root cause was that people were still raising animals for human consumption?
↑ comment by the gears to ascension (lahwran) · 2023-01-10T02:59:35.014Z · LW(p) · GW(p)
in sufficiently complex systems, there is often no single root cause, even if you could see the entire causal graph. This seems like a case where that applies to me.
... but also, what jeff said
comment by DragonGod · 2023-01-09T14:23:42.016Z · LW(p) · GW(p)
I don't understand what I just read.
Replies from: swarriner↑ comment by swarriner · 2023-01-09T14:32:17.287Z · LW(p) · GW(p)
It's the "Boy who cried wolf" fable in the format of an incident report such as what might be written in the wake of an industrial disaster. Whether the fictional report writer has learned the right lessons I suppose is an exercise left for the reader.
comment by FeepingCreature · 2023-01-09T16:36:46.475Z · LW(p) · GW(p)
See also: Swiss cheese model
tl;dr: don't overanalyze the final cause of disaster; usually it was preceded by serial failure of prevention mechanisms, any one or all of which can be improved for risk reduction.
Replies from: None↑ comment by [deleted] · 2023-01-10T01:56:02.109Z · LW(p) · GW(p)
Yeah but false positive. Every time anyone mentions all the ignored warnings they never try to calculate how many times the same warning occurred and everything was fine?
It's easy to point to O rings after the space shuttle is lost. But how many thousand other weak links were NASA/contractor engineers concerned about?
comment by Metacelsus · 2023-01-10T23:39:32.126Z · LW(p) · GW(p)
OK, but why equip sentinels with flutes?
Replies from: jkaufman↑ comment by jefftk (jkaufman) · 2023-01-11T00:15:14.791Z · LW(p) · GW(p)
To make the task less monotonous. This is also a major benefit of slings.
comment by Oliver Sourbut · 2023-01-16T21:53:14.765Z · LW(p) · GW(p)
Oh boy, this is terrifyingly familiar from my oncall days!
comment by greylag · 2023-01-09T07:01:53.119Z · LW(p) · GW(p)
(Epistemic status: lyrics)
I’m not too clear about what you just spoke. Is that a parable, or a very subtle joke?
Replies from: aphyer↑ comment by aphyer · 2023-01-09T14:16:47.853Z · LW(p) · GW(p)
If you're making false claims of your incomprehension, it's clear that you've missed the moral dimension. When you truly can't get what someone is saying, remember today and the games you were playing. It takes people effort to give added proof...and they won't put that in for the boy who cries wolf.