Water and electricity are two components without which a modern home can’t function well. Both are provided as a utility, and both have strictly defined access points from which they can be consumed - taps for water and outlets for electricity.
But there are also differences. Every child knows that electric shock can cause injury even as a result of a short exposure - hence most perceive electricity as a powerful force. This force however has a binary switch attached to it, in the form of switches, circuit breakers and distribution board. Turn it on - electricity is flowing, turn it off - it’s not. When off, electricity can’t leak by design.
Water, on the other hand, is not perceived as such a great force because damage from short exposure is unlikely to be too severe. Additionally, indoor plumbing has no binary on-off switches - it’s measured by a degree of “open” or “close”, “hot” or “cold.” As a result, leaks can and do occur from time to time. And it’s these leaks that have a potential to do costly damage over time but still are not perceived dangerous enough to warrant immediate attention.
There are many things in software applications that are binary in nature - web server daemon is up or down, for example. We all take these all-or-nothing components seriously, because when it’s nothing, the app is down.
But we have our fair share of potentially leaky stuff as well - memory leaks, file descriptor leaks, network connection leaks, and so on. In other words, things that don’t happen instantaneously but build up over time, often hidden behind other bigger component. Some of us don’t take these issues seriously enough because they lack the perceived power of being able to cause significant damage quickly enough. And it’s a mistake.
When monitoring a component of “electricity” type, most common test is to send a probe - if it returns OK, the component is up (“active monitoring”, “active polling” or simply “polling”). But this doesn’t work when monitoring a component of “plumbing” type - if water is flowing, it doesn’t mean there is no leak. In this case, a set of alarms instrumented into the component itself would be a better fit.
The sooner we realize different nature of various components of our applications and the need to monitor them differently, the higher uptime for our applications we are going to achieve.