Average Precision (AP) is an evaluation metric for ranking systems that’s often recommended for use with imbalanced binary classification problems, especially when the classification threshold (i.e. the minimum score to be considered a positive) is variable, or not yet known. When you use AP for classification you’re essentially trying to figure out whether a classifier is any good by looking at how the model’s scores rank positive and negative items in the test set.
Typical scenarios where AP might come in handy include:
- You’re classifying but the threshold is going to vary depending on external variables, so you can’t use normal thresholded metrics.
- The classifier model is to be used both as a classifier and a ranker in different applications.
- You’re building a pure classifier, but you’re interested in understanding how your classifier ranks results by its scores, because it could help you better understand where it’s going wrong, etc.
There’s one important property of AP which in my experience is rarely mentioned but incredibly important when used for classification problems:
Average Precision depends heavily on positive rate (i.e. class priors)
To some people this might seem obvious. Precision is famously skewed by positive rate, so surely Average Precision must be too! Well, that’s not a great argument. Although it reaches the right conclusion.
When I ran into this for the first time, I actually managed to convince myself of the opposite: that AP was independent of the positive rate. That’s wrong. This is how I tricked myself: “AP is the area under the precision-recall curve. If positive rate is high, then it will be easy to have high precision, but it will be hard to catch all the positive items (i.e. have high recall). If positive rate is low, then precision will be hard to get, but recall will be easier. Probably the changes in precision and recall cancel each other out to some degree.”
It took me a couple of minutes to realize the mistake in the above reasoning.1 Sloppy half-reasoning like that is often useful but dangerous. It’s important to always double check by either consulting the literature or trying to find a more rigorous argument or proof.
In this case, it only took me a couple of minutes to realize where the problem was: recall does not vary with positive rate, unlike what I implied. Of course, if the positive rate is higher, there’s going to be more items that you won’t catch, but the ratio of caught items is expected to stay the same.
Recall is independent of positive rate
Let’s visualize this with an example. In the following image we’ll position blue and orange dots on a line. Each dot represents an item in the test set. The farther to the right the item is, the higher the predicted score for that item is. Blue dots are positive examples, while orange dots are negative examples. Let’s look at what happens with precision (P) and recall (R).
Now we’ll look at the same exact example, but we’ll undersample from the positive class. Note that the model giving the scores is exactly the same, all that’s changing is that we’re getting rid of some of the positive examples in the test set.
Notice how when we undersample from the positive class, precision can only go down. There is no positive example we can eliminate that will make precision go up. Recall, on the other hand, may go up or down depending on which side of the threshold we eliminate items from, but the expected result is that it will stay the same. More on that later.
That example illustrates how when we reduce the positive rate in a test set, precision is expected to go down and recall is expected to stay the same. Of course, a single example with a small dataset doesn’t prove anything, but hopefully it helps visualize why precision and recall behave differently when we change the positive rate.
Another way to visualize how positive rate won’t affect your recall is to imagine having a dataset and then undersampling the negative class. When you undersample the negative class positive rate will go up, but recall won’t change, as the formula for recall (
TruePositives / AllPositives) simply ignores all negative examples. In other words, recall depends only on the distribution of scores for items of the positive class, while precision depends on both the distribution of scores of the positive and the negative class, and also on the class priors.2
How this applies to AP
What we’re seeing then is that for a particular threshold, lowering the positive rate is expected to lower precision and maintain recall.3 In consequence, all values of the precision-recall curve are expected to go down, and therefore, AP –being the area under the precision-recall curve— is expected to go down as well.4
When could this become a problem? There’s two big cases.
- Comparing the performance of models tested with different data.
- Comparing the performance of a model in different subsets of a test set.
Suppose the following case: your classifier is retrained periodically with new data, and you want to know whether the model got better or worse. It’s dangerous to compare APs for different test sets because if the positive rate changed from one to the other, then you may have a big swing in AP that’s not really related to how good the model is!
One option to mitigate this could be to test all models with the same dataset, or to create carefully crafted stratified test sets that keep these properties constant artificially.5 Another option is to use AUC-ROC or other metrics that don’t have this problem instead of AP. But obviously, they come with their own caveats.
The other case is when you want to check for which types of items your model performs well or bad. If you simply compute AP for different subsets of your test set, you need to be very careful that the positive rates in those different subsets stay the same! If not, AP might give you a completely wrong idea.
To make it easier to visualize: if the positive rate is 50%, a completely random model is expected to have an AP of 0.5. On the other hand, if the positive rate is 2%, AP of a random model will be 0.02. Getting an AP of 0.5 in that case might be incredibly hard.6
Next time you’re working with AP in a classification context, be weary of the effect positive rate has on it.
- Apart from the obvious fact that having two things pulling in opposite directions is not enough to say they cancel each other out.
- With threshold and the scores of the positive class being sampled from a distribution , the expected recall for a large enough test set will be . That’s invariant to class priors, and to the underlying distribution of positive class items in the test set.
- “Lowering the positive rate” is reducing the ratio of the test set that is of positive class. It’s assumed that the underlying class-conditional distributions stay the same. Crucially, though, if in real life you see your positive rate change significantly, this will likely be accompanied by a change in class conditional distributions as well! Analysing such cases is complex. Understanding that AP reflects positive rate is just the first step in avoiding pitfalls when you have varying positive rates in test sets.
- There’s an important interaction that I decided to leave out of this blogpost for simplicity: if by undersampling we cause the minimum score for positive class items to change (e.g. if by undersampling the positive class, we get rid of the positive items with the lowest score) then AP could go up even if precision at all thresholds goes down! That’s because if the minimum score for ground-truth-positive items goes up, low thresholds (which usually have bad precision) will stop being taken into consideration for computing AP. All of this is not too important because for big-enough, dense-enough, natural datasets this effect should be much smaller than the one introduced by the change in precisions at each threshold. Nevertheless, I found it interesting.
- Of course, doing this makes evaluation metrics harder to interpret, as they no longer mean simple things like “the probability of a classified-positive to be ground-truth-positive”. Disentangling the meaning of an evaluation metric in a stratified test set requires some work. Also, doing this might penalize you too much for poor performance in very rare items.
- Interpreting some value of some evaluation metric to be “good” or “bad”, “hard” or “easy” depends on a lot of different factors. Here, “hard” is just intended to mean “much better than random guessing”.