In practice, ‘alerts’ can have different meanings in different organizations

September 2, 2023

One of the things I’ve become more and more aware of over time as
I talk about our metrics, monitoring, and alerting system is that what ‘alerts’ are can vary
quite a bit between environments, despite everyone using the same
term and often the same technology to implement their particular
form of ‘alerts’. Some of the difference in what alerts mean is
technological and some of it is organizational (or ‘operational’).

The first big difference, which is partly technology (in how alerts
are delivered) and partly organizational, is whether ‘alerts’ must
be noticed and responded to outside of regular working hours. In
other words, whether or not alerts can wake people up in the middle
of the night. If alerts can, then you want to be very sure that
these alerts genuinely matter; you will, for example, have good
reason to only alert on problems in ‘user journeys’ and defer notifications about anything
else to regular working hours in some way or another.

(Well, you don’t have to, but if you insist on waking people up in
the middle of the night for everything, pretty soon you won’t have
very many people left to wake up. Especially good people.)

The second difference is how much alerts interrupt people and require
them to interrupt their work. For example, if there is a policy
where all alerts must be acknowledged and investigated promptly,
even if they’re ‘during regular working hours’ alerts. This pushes
people to make alerts visible and interrupting, and requires people
to interrupt their work to investigate them. Again, this drives
sensible places to make sure that ‘alerts’ really matter and to err
on the side of doing more work in alert setup to be sure of that.

(Sometimes people have different sorts of ‘alerts’ where only some
sorts (eg, ‘P1 alerts’) require 24/7 response or immediate action.)

We are a relatively extreme
version of the other side of both of these differences. For us,
alerts in general are not ‘you must pay attention to this 24/7’ but
mostly ‘here is something you probably want to look at’. Of course sometimes the thing we
probably want to look at is ‘something exploded’ and we’re going
to jump on that, but there’s no requirement that we look into all
alerts immediately. Our alerts are designed to be quiet but that isn’t because they page us in
the middle of the night, it’s because we want to keep our email
volume down and avoid alert fatigue where we ignore alert emails
and so miss more important issues.

My sense is that most places today have ‘alerts’ that at least
sometimes are on the ‘wake people up and/or interrupt their work’
end of things, so that when you talk about ‘alerts’ in general,
this is what most people assume. I don’t think we have a good, well
understood term for the less intense sort of alerts. In the past
I’ve called them ‘notifications’,
but I suspect that people wouldn’t necessarily understand what I
meant by that if, for example, I talked about ‘how we use Prometheus
to drive our notifications’ (instead of ‘alerts’). There’s also the
issue that a lot of our technology for this specifically talks about
‘alerts’ and things like ‘alert rules’ (cf). It’s hard to write about this
in a clear way without using ‘alert’ and ‘alerts’.

Read More