You are developing a Machine Learning model for an application in the industry. How do you measure the model’s performance? How do you know whether the model is doing its job correctly? In my experience, there’s two common approaches to this.
- Use a normal Data Science metric like F1-score, average precision, accuracy or whatever suits the use-case.
- Target an existing business metric like click through rate for recommender systems, cost reductions for automation of existing systems, profit and/or sell through rate for pricing systems, etc.
In the rest of this blogpost, I’ll talk a bit about how we can and should measure the success of our Machine Learning models, keeping in mind that the final objective is to answer the question “Is this thing worth putting in production?”
Data science metrics
I’ve talked in the past about unexpected difficulties when dealing with the Data Science-y metrics we all know and love like average precision, F1, recall, etc. One problem I haven’t talked about is how these metrics are hard to translate to a decision on whether the model is worth using: they only really tell you whether a model is in some sense hitting its mark.
To understand whether an already developed model is worth putting in production you need to be able to contrast the gains of having the model running with the costs of keeping it running and for that purpose, how exact your model is often doesn’t tell you that much about what monetary gains you stand to get from having it in production.
Let’s put this into an example. Suppose you’re automating some system using a classification model and your model achieves a 98% accuracy, while the current human-driven system has an accuracy of 98.7%.
Is this model worth putting in production?
Obviously that’s not enough information to make the decision. You need to take into consideration how costly the system was before it was automated, and how costly it will be after automation.
Now, let’s say you know the current average cost introduced by misclassified items is , and the savings of automating one item (i.e. manual cost minus automated cost) is . The obvious math to do is to check whether , that is, whether the of average savings is or isn’t higher than the average cost increase from extra misclassified items (i.e. misclassification rate times misclassification cost).
Do you now know whether the model is worth putting in production? Not really, although many organizations would consider this to be enough information and would move forward with the automation process, there’s actually many more crucial questions to ask.
What are the consequences of having this extra 0.7% of wrongly classified events/entities/items? Can the current infrastructure handle this increase in error rate, or do we need some form of investment to increase the capacity of some other part of the pipeline? Is this going to degrade the customer experience?
At an even more basic level: the current average cost of misclassified items is , but our model might make different kinds of mistakes than the ones that are being made right now. Could the kinds of mistakes that the model makes be more costly than the mistakes usually made by the current system?
Often employees are trained to get the cases that really matter right, and fail on cases that are somewhat ambiguous or relatively unimportant, but that’s harder to do reliably with Machine Learning models. Also, cases that really matter might be infrequent. So you should not be caught by surprise if your model happens to perform poorly on examples that are considered particularly important by the business.
How do you take all of these things into consideration when trying to measure whether your model’s evaluation metric is good enough?
Conversion to business value
Often measuring how all of the aspects I mentioned above add up is very hard or, ironically, itself very costly to do right. That’s if it even is possible. Let’s recap: what we’re doing here essentially is to try to find a formula to convert your data science-y evaluation metrics into business metrics. So maybe we should talk about business metrics directly instead, which appear to be the real source of truth for determining whether a model is worth using.
Let’s repeat this just one more time, because it’s an often ignored aspect of Machine Learning Engineering: even if your model can adequately perform the task that it’s meant to perform, that doesn’t mean it’s a good idea to use that model.
If we’re going to discard them for the final decision-making process, what’s the value of data science metrics?
Well, there’s the obvious point that measuring F1 or MAE in your offline experiments is much easier than somehow estimating business value directly. So if you’re doing an early-stage PoC, getting your data science metrics is a good way to have a quick answer to “can this task be performed by a model?”, and that’s a great place to start, even if it doesn’t paint the full picture.
There’s also the point that the underlying Data Science metrics are often much easier to monitor. If you are monitoring for degradation in performance in your data science metrics you might be able to make the necessary changes to your models before they impact your business metrics significantly. If you only have a final business metric to look at, you’ll often find out about model degradation when it’s already too late.
There’s a final important point which is that data science metrics can often be more comparable between subsets or different time frames than business metrics.1 So you might be better able to distinguish when your model is performing worse vs. you’re getting worse results due to external conditions.
Let’s assume you’ve solved the issue of mapping your data science metric to effects on costs, not just current average costs, but holistic expected costs taking into consideration how the model itself changes things. It may look like you’re good to go now, but there’s actually a couple of extra questions to answer to before you can say confidently that your model is worth putting in production.
Will your model stay as good as it is? How could you prove that it would? Does it need Continuous Training?
If you don’t think it’s going to stay as good as it is now, or if you can’t prove it, you have to ask yourself: what would be the effect of this model getting suddenly worse?
Models that are meant to automate currently manual tasks have to be examined for this potential issue more than other types of models, because usually if you automate a task, that means you lose access to the “old”, non-automatic way of doing the task. You no longer have a readily available, well-trained team that can do the job efficiently. You may not even have the necessary infrastructure to do it anymore.
So what happens if your model starts performing worse than it should? Do you have a plan B? Can the business withstand a couple of weeks of poor performance in that task while your data scientists figure out how to get the model back? What’s the cost of rolling back to a manual solution?
For some models, it’s going to be fine even if the task they performed cannot be performed at all for some time. Either the task is not time sensitive, it’s not a crucial aspect of the business, or it’s something that wouldn’t be performed at all if it weren’t for the model.
For all other models, you should have some kind of contingency plan. You may have a secondary, simpler (e.g. rules-based) model that can do an okay job as a backup if your ML model starts failing.2 For yet some other models, it’s not too bad to fall back to a manual process because it doesn’t need that much training and doesn’t require much investment in infrastructure.
I’m not a betting man, but..
An interesting gamble that can often pay off is to first put the model into production and then immediately start working on the contingency plan. The idea being that the models you have been working on just now are unlikely to degrade for some time, and even if they do degrade very early, it’s often early enough to roll back to whatever was there before the model. The biggest danger in that gamble is that it can lead to complacency and the contingency plan might get de-prioritized by stakeholders, leading to potential disaster down the line. You should evaluate how much you trust your organization to properly prioritize these types of tasks.3
Occasionally a full-on gamble without any kind of backup plan is still worth it. This can happen when immediate gains are so important that you’re willing to risk proportionally much higher costs down the line. This can often be the case in very early-stage startups. If we don’t get something out now, we won’t be able to convince enough investors to keep the company operating. So just put something in production and deal with the consequences as they come.
Finally, if you’re not entirely confident about your model’s worth, occasionally you can still do a partial rollout. Get the model into production for some cases, where you’re more confident the model will perform well, and keep the current system for the rest. You can keep working in making your model more robust, but in the meantime, you can start reaping some of the rewards.
Machine Learning models are not yet sufficiently well-understood that we could confidently measure the expected results of putting them in production. But still, if you’re like me you probably could do a better job than you’re currently doing.
Part of the issue is that Machine Learning education often stops at the point where you get an evaluation metric that looks good. But as an industry practitioner you should be aware that you’re only half-way there at that point: you have to work on converting that metric you got into something that’s actionable on a business level. One interesting tool to have in hand here is causal inference, but that’s a topic for another day.
This is the only sure advice I can give: pay attention to what you’re measuring, and always remember your model performance will most likely degrade over time. Continuous Training, Model Monitoring, and Contingency Models are necessary for critical, long-term models. You shouldn’t get complacent, as the moment a model starts showing degradation is not a good moment to start implementing countermeasures. Plan for them in advance.
But at the end of the day, you’ll have to make the choice of what’s the right gamble in your situation. Waiting until you’ve implemented all your detailed long-term cost analysis, contingency models and model monitoring contraptions before going to production is almost certainly a sub-optimal strategy. Not implementing anything like that ever and accumulating a never-ending stream of unmonitored models waiting to fail also is. There’s some happy middle there that you should strive for.
1. That’s only if you carefully choose your metrics for that purpose, though. If you go willy-nilly using whichever metric you found lying around, you might be really sensitive to external conditions. See for example my earlier blogpost about how average precision is sensitive to class priors. Always remember your prediction set’s statistics are usually not fixed over time. ←
2. It’s a good idea for the backup model to be as simple as possible. The idea is: your backup model should be extremely robust to things like data drift, and it should make reasonable, motivated predictions under all circumstances. Simple models tend to have those properties, often in exchange for an overall less impressive performance. ←
3. If you don’t trust your organization’s decision-making at all, then that’s a much more pressing issue than whatever I’m talking about here. I’d stop reading this blogpost and start reading something that can help you change that instead. ←