SLM: Issues with traditional SLM
We introduced some basic aspects of SLM in the previous posts. We now turn to a list of issues with traditional Service Level Management.
Quantity instead of Quality
In general, it is easier to define a metric or KPI that is based on a quantitative parameter: the number of incidents, number of escalations, number of errors, minutes downtime, etc. That is the reason that this type of parameters are often found in SLA’s. In a lot of cases, quantity is not related to quality.
‘The number of incidents resolved’, for instance, is not a good quality parameter. It may well be that some of the incidents took years to acknowledge. ‘The number of incidents resolved in an agreed time frame’ is better, but the resolution may be only a temporary workaround? Or, in order to attain the required time frame, staffing is doubled and costs skyrocket.
We can go on giving examples like this. Sometimes, quantitative information is good, or can be rephrased to be meaningful. Sometimes quantitative parameters can be used as informational. But be avoid drawing up an SLA on the basis of purely quantitative parameters only.
There is another way to handle quantitative KPIs, but we leave this for a later post.
In our series of posts concerning SLM, SLAs, etc. we have started considering aspects of traditional SLM that lead us astray. We continue with the quantitative versus qualitative discussion.
Lack of representativeness
The reason for defining metrics and KPIs was to measure the quality of a service. Quality is an abstract and broad notion that may have a subjective connotation that is hard to put in words. In most cases, KPIs are a representation of an aspect of the service, not the quality of the service as a whole. Performance related metrics that deal with speed or turnover, for instance, are often used as KPIs (see examples 1, 2, 4, 5).
Performance (typically measured by the quantitative types of metrics, see here) is only part of the quality that is expected. Consider the following two examples:
A ticket can be solved in the requested amount of time but using a temporary solution. Overall quality is suffering, having an impact on customer satisfaction.
The time on hold at a service desk is too high according to company standards, but most cases are resolved during this one call meaning that the effort pays and customers are positive about the service.
In other words, the lack of representation can go both ways: the quality may be higher or lower than expected, based on the performance KPI.
Another example: It is true that, on average, a customer that has to wait long on the phone before getting someone on the phone will likely be less happy than someone who does not have to wait at all. It is true that an IT issue that is resolved in 1 hour will be more appreciated than one that takes 2 days. But living in the illusion that fixing all IT incidents within an hour is the way to have a perfect IT service desk, is doomed to fail.
The wrong way around
The practice of defining (mainly performance-based) KPIs is wide-spread, so wide-spread in fact that often the reasoning is reversed: Instead of measuring a set of KPIs that may have a positive influence quality, one often (implicitly) states that the quality of a service is defined as the outcome of these KPIs.
To give an extreme example: Suppose the weather is proven to have an impact on customer satisfaction of a call center. Is it rational to state that the wheather defines quality? Should the Service Level be calculated based on the weather statistics? I don’t think many managers would buy this.
On the other hand, when Service Levels are defined, one often finds quality defined as just one parameter like, e.g., ‘waiting time on-hold’ or ‘system uptime’.
Focus on the wrong instances
A consequence of dealing with averages and aggregate metrics is that in some cases, when a threshold value is reached there is no longer an incentive to act on the individual instance that went wrong. As an example, consider example 6 given before. If the resolution time of an incident is above threshold, it is counted in the percentage and it will not get any better over time, but also not worse! In other words, there is no longer a reason to fix the incident as soon as possible. Rather, it may be better to focus on other incidents that can still be fixed within threshold.
The result may be a large number of open incidents and thus unhappy customers, and rightly so.
Gaming the system
We start by considering metric 1 in the examples given before, which is related to the on-hold time of a call. There are several ways to avoid the penalty of crossing the threshold for this metric:
One such way is to drop the call (a technical error is always possible after all).
Another way is to have an automated answering machine ask the user some (possibly irrelevant) questions.
A third option is to answer the call but forward it to a different team as soon as the user starts explaining the question or problem. In none of these cases, the service can be called qualitative, but the appropriate metrics are perfect. Other metrics can be gamed in similar ways.
To make things worse: Since usually only a small amount of KPIs are selected for monitoring service quality, they are easy to keep an eye on and follow-up. It is thus possible to make sure most of the metrics are managed in such a way that success is guaranteed.
If you combine this effect with the earlier issues one gets a nice collection of, say 5 KPIs and concludes that the service quality is optimal.
Avoid Accountability
When all goes wrong, and the KPIs are showing bad performance, there is one last option: avoid accountability. For instance, because other teams have not done their job which caused the resolution time to be above threshold. This can quickly lead to long discussions and eventually mistrust between teams and companies.