Fast optimization of classification thresholds

Binary classification problems (target/non-target) are often modeled as a pair (f, \theta) where f : \mathbb{R}^D \to [0, 1] is our model, which maps input vectors to scores, and \theta \in [0, 1] is our threshold, such that we predict x to be of target class iff f(x) \geq \theta . Otherwise, we predict it to be of non-target class.

Representing dataset, scores and threshold
Model predicts “target” class
Model predicts “non-target” class

The threshold \theta is usually set to 0.5 , but this needs not be the best choice. If f can be assumed to be well-calibrated1 –which usually isn’t the case, though– and the cost of a false positive and a false negative are known constants (call them C_{FP} and C_{FN}), then a theoretical cost-minimizing threshold can be computed with the formula \frac{C_{FP}}{C_{FP} + C_{FN}} . That means 0.5 is the optimal threshold under these conditions only when C_{FP} = C_{FN} . For imbalanced classification, usually that equality does not hold. And even in balanced settings, it’s more likely that it won’t.

Now, that formula won’t save us every time. We need something else. Sometimes our model is not well-calibrated, as even when we use calibration methods on it, we may not trust that the result is actually properly calibrated. In other cases the costs for false positives and false negatives may not be known or may not be constant. What’s the cost of saying a dog is a cat if you’re a photo-categorizing software? Or the cost of wrongly tagging a sports news article as politics? It’s not always easy to quantify those costs.

So, suppose we wanted to use some other strategy to select the classification threshold. We could try to find instead the threshold \theta that maximizes the F_1 score in our validation set, or some other evaluation metric. How would we go about doing that?

The rest of this blogpost deals with finding thresholds that maximize (or minimize) classification metrics in time \mathcal{O}(Nlog(N)) where N is the size of the validation set used for optimizing the threshold.

The algorithm itself is very simple, but analyzing it seriously can help paint a clearer picture on how changing thresholds affect our metrics, and about what’s going on when we choose to use thresholded metrics for our classification models.

The general idea of how you get to optimize these thresholds fast is used in a bunch applications in Data Science. Something very similar to what I’m describing is used to find splits when building CART trees by scikit-learn2 (the similarity to this problem is pretty obvious if you remember how CART works) and is also used when building the precision-recall curve and related metrics like average precision, also by scikit-learn.

Algorithm for general threshold optimization


Let’s start by visualizing examples of the functions we’re trying to optimize.

Remember we’re trying to find the best threshold so that the resulting target/no-target classification using that threshold maximizes some evaluation metric like accuracy, F1, etc.3

Here the blue line is the function itself. The x axis represents all possible thresholds, and the y axis has the values of the evaluation metric being considered.

We also added green circles and red crosses. They represent our output scores for items in our dataset. In these examples the dataset –one would usually use the validation set– has 5 elements of target class and 5 of non-target class.4 These dots and crosses are not really part of the function itself, but visualizing the dataset together with the function helps (me) understand what’s going on.

Summarizing: the blue line is the evaluation function as we change the threshold. Circles and crosses mark scores for elements in our dataset of target and non-target class respectively.

Note how the value of our metric only changes when we move the threshold “over” some score in the dataset, changing the TP or FP count (more on that later). This will be true of all metrics that can be defined as a function of TP, FP, TN and FN.5

Example function of thresholds to accuracy
Example function of thresholds to precision

Algorithm description

Looking at the shapes of the functions can be pretty discouraging. They don’t belong to any of the well-known function types that are easy to optimize. They’re clearly not linear or affine, they’re not convex, in fact, not even continuous, and in the entire domain the derivative is either undefined or 0, so all gradient-based methods are out.

But, they have a very lovely property which is that we only need to evaluate a finite number of points to know all values of the function. This can be seen in the example functions, and isn’t hard to prove more rigorously.

That property of only having to evaluate a finite number of thresholds takes us all the way from not even knowing how to find the best threshold to having an obvious quadratic time algorithm: just try every possible value and pick the best one! Shown here as python-ish pseudo-code (remember vectorization will come later).

def find_threshold_quad(y_proba, y_val, eval_func):
    """Find a threshold that maximizes eval_func."""
    evals = np.empty_like(y_proba)
    for i, threshold in enumerate(y_proba):
        # confusion_matrix runs in linear time
        # w.r.t dataset size.
        tp, fp, tn, fn = confusion_matrix(
            y_val, y_proba >= threshold
        # assume eval_func runs in constant time!
        evals[i] = eval_func(tp, fp, tn, fn)
    best_threshold_idx = np.argmax(evals)
    return y_proba[best_threshold_idx]

Something to note here: we’re saying that eval_func should run in constant time but scikit-learn style thresholded evaluation functions (like sklearn.metrics.precision_score) take linear time usually. How can that be?

That’s because scikit-learn style functions count the TPs, FPs, etc. as part of the evaluation function itself. For the purpose of this blogpost we are using evaluation functions that take the TP, FP, TN and FN as pre-computed values. We will assume the eval_funcs are constant time functions for the rest of this blogpost. Accuracy, precision, recall, F1, g-mean, and most other thresholded metrics have obvious constant time implementations as functions of TP, FP, TN and FN. For example, accuracy is just \frac{TP + TN}{TP + TN + FP + FN} , so it always involves 3 sums and a division.

We can improve on the above algorithm’s time complexity, though. The trick is the following: if our list of scores is ordered, then there is no need to fully recompute the TPs, FNs, etc. for every single threshold, they don’t really change that much when we move from smallest to largest (or largest to smallest).

Assume all scores are different from each other to simplify the argument and the pseudo-code. We’ll deal with the problem of repeated scores later when we vectorize our code.

For the smallest threshold, i.e. when we are using the smallest predicted score as the threshold, everything is classified as target class, as every score (including itself) is greater than or equal to the smallest one. That means TN and FN are 0, while TP and FP are the count of elements of ground-truth target class and non-target class respectively.

Minimum threshold

While we move to the right, for each actual we pass, we only need to update those numbers (TP, FP, etc.) by incrementing or decrementing them based on the class of the previous item. Each time we “pass over” a score, if that was of class target, we lose a TP. If it was of class non-target, we gain a TN.

Second threshold

This realization that we don’t need to recompute TP , etc. from scratch every time let’s us get rid of our linear-time confusion_matrix function, and replace it with constant-time modification of our tp, fp, tn and fn variables.

We don’t quite manage to drop time complexity to linear time over the entire optimization because we need to sort the scores to do this. After we sort them, the actual optimization itself is just linear time, though.

def find_threshold_loglinear(y_proba, y_val, eval_func):
    """Find a threshold that maximizes eval_func."""
    sort_idx = np.argsort(y_proba)
    y_proba, y_val = y_proba[sort_idx], y_val[sort_idx]
    evals = np.empty_like(y_proba)
    tp = y_val.sum()
    fp = (y_val == 0).sum()
    tn, fn = 0, 0
    for i, threshold in enumerate(y_proba):
        # assume eval_func runs in constant time.
        evals[i] = eval_func(tp, fp, tn, fn)
        # Note we modify _after_ evaluation.
        if y_val[i] == 1:
            tp -= 1
            fn += 1
            fp -= 1
            tn += 1
    best_threshold_idx = np.argmax(evals)
    return y_proba[best_threshold_idx]

That’s pretty much all there is to it. In the following section we’ll present a vectorized numpy implementation of this same algorithm, which also covers some edge-cases we ignored here.

In particular, think what happens when the best threshold is one that classifies everything as non-target. Is that case handled in the above pseudo-code? Also think about the case where a score is repeated in the dataset. Are we handling those cases correctly? How could that be fixed?

Vectorized numpy implementation

def find_threshold(y_val, y_proba, eval_func_vec):
    """Vectorized threshold optimization.
    This uses `np.cumsum` to vectorize the conditional
    for-loops we used in the naïve version.
    # We start by sorting the probas in descending
    # order. (That's what the `-` is for)
    sort_idx = np.argsort(-y_proba)
    y_proba, y_val = y_proba[sort_idx], y_val[sort_idx]
    # We need to add an np.inf as a possible proba to
    # handle the case where the best threshold is one
    # where all predictions are of _non-target_.
    y_proba_completed = np.concatenate(
        [[np.inf], y_proba]
    unique_idx = np.concatenate(
            # Due to the way np.diff works, we need
            # to add the final element by hand.
    # These 0s here are added for getting the right
    # value with `cumsum` for `np.inf`.
    y_val_completed = np.concatenate([[0], y_val])
    y_val_completed_reverse = np.concatenate(
        [[0], y_val == 0]
    tps = np.cumsum(y_val_completed)
    fns = np.sum(y_val_completed) - tps
    fps = np.cumsum(y_val_completed_reverse)
    tns = np.sum(y_val_completed_reverse) - fps
    # This eval_func_vec should compute the metrics
    # for several TP, FP, TN, FN configurations at once.
    values = eval_func_vec(
        tps[unique_idx], fps[unique_idx],
        tns[unique_idx], fns[unique_idx],
    # nanargmax for dealing with undefined metrics,
    # like precision above highest score.
    best_unique_idx = np.nanargmax(values)
    best_idx = unique_idx[best_unique_idx]
    return y_proba_completed[best_idx]

The last question to answer is: how do you implement an eval_func_vec? Due to the overloaded syntax used in numpy, it’s usually very simple. A couple of examlpes follow.

def accuracy_vec(tps, fps, tns, fns):
    return (tps + tns) / (tps + tns + fps + fns)
def geometric_mean_vec(tps, fps, tns, fns):
    sensitivities = tps / (tps + fns)
    specificities = tns / (tns + fps)
    return np.sqrt(sensitivities * specificities)

The vectorized find_threshold above could be used for finding splits in algorithms for building decision trees, or, with small modifications, it could be used for creating precision-recall curves and other metric-to-metric curves fast.

Similar cumsum-type optimizations can be applied to time series data when you’re counting items before some time, accumulated averages, etc. The idea behind this algorithm (and vectorization) is useful in a variety of situations in Data Science, and is worth keeping in mind.


  1. Well-calibrated intuitively means that if we take the set A := \{x: f(x) \approx p\} , then the ratio of items of actual target class in A should be close to p . In other words, when a model is well-calibrated, f(x) can be interpreted as the probability that x is of class target.
  2. Here’s a link to an example place where sklearn uses the idea for trees and here’s how they use it for computing precision-recall curves and metrics that depends on counting TPs and FPs at all thresholds. Both those pieces of code have a lot of surrounding type-conversion, edge-case-matching, optional-parameter-considering code, so looking at the implementation in this blogpost first may help separate the important parts of that from the noise.
  3. When I say “the best threshold” I actually mean “one of the thresholds that are the best”. In other words, a threshold such that no other threshold is strictly better.
  4. In case the example is misleading, I want to clarify that the datasets needn’t be balanced. It happens to be balanced in these examples, but I could have chosen an imbalanced dataset as well.
  5. Alternatively, in terms of TP, FP, positive count, and negative count. Or other things from which we can compute the confusion matrix.

Storing a list in an int

Python’s default ints, unlike in C, Rust or Go, are of arbitrary size.1,2 What that means is there’s no absolute maximum value your ints can store. They’ll grow as long as they fit in memory.

For example, you can open python3 and run the following.

>>> import math
>>> math.factorial(2020)
[number omitted]
>>> math.log2(math.factorial(2020))
>>> type(math.factorial(2020))
<class 'int'>

So a normal, every-day int in python can easily store a value that would take up 19273 bits in a C-style fixed-size unsigned int type. In a language like python where convenience is valued over speed and memory efficiency, that’s really useful.

This unlimited precision also means we can store an arbitrary amount of information in a single int. With the right coding, an entire book, an entire database, or anything else fits in a single python int.

We can therefore imagine a dialect of python where we only have ints, and we need to represent everything else (dicts, lists, etc.) just by using ints. We’d also have special functions and methods that would treat ints as if they were lists, dicts, etc. This can be a fun exercise, and that’s what we’ll be doing in this post.

There’s an obvious way to do this: all data structures are just bit-arrays in memory. Worst-case scenario, it’s a set of related bit-arrays (like each node in a linked list or tree, for example), and sets are just bit arrays as well. Bit arrays can be interpreted as binary numbers. So we always have this option. But that’s a bit boring.

For this post, and for the following posts in this series I’ll present some ways of representing complex data structures as ints that I think are interesting. They are not necessarily the most compact, the most reasonable, or the most efficient. The common objective is to find interesting or fun representations of these data structures.3

Introducing Gödel numbering

The first data structure we’ll work on representing is list. What we’ll use is called Gödel numbering, named after the logician Kurt Gödel. We’ll be dealing with lists of unsigned integers (i.e. naturals) only, for convenience.

Gödel numbering works by abusing the fact that each natural number greater than 1 is uniquely represented by its prime factorization. This is given by the fundamental theorem of arithmetic.

Looking at some examples.

  • 126 = 2^1 \cdot 3^2 \cdot 5^0 \cdot 7^1.
  • 256 = 2^8.
  • 17891 = 2^0 \cdot 3^0\cdot 5^0 \cdot [\ldots] \cdot 17891^1.

A number is uniquely identifiable by the list of the exponents of its prime factors (up to its highest non-zero exponent). So we can say 126 represents the list [1, 2, 0, 1]. The first place in the list is the exponent of 2 in the prime decomposition of 126, the second value is the exponent of 3, and so on.

A couple of examples:

  • 2 = 2^1 \to [1]
  • 143 = 2^0 \cdot 3^0 \cdot 5^0 \cdot 7^0 \cdot 11^1 \cdot 13^1 \to [0, 0, 0, 0, 1, 1]
  • 9 = 2^0 \cdot 3^2 \to [0, 2]

What happens if you have a 0 at the end of the list? Well, with this encoding you can’t. There’s an infinite stream of primes with exponent 0 in our factorization, so we need to stop somewhere.4 We chose to stop at the last non-zero entry.

This representation uses very large numbers when the list contains somewhat large numbers. That’s because the numbers in our list become exponents, so the size of the int grows exponentially with them. For example, [50, 1000, 250] uses a number with size 2266 bits. On the other hand, relatively long lists of very small ints, and especially large sparse lists (i.e. where most of the values are 0) have very compact representations compared to other int-encodings of lists.

Keep in mind this is not a practical encoding of lists as ints for programming. It’s meant to be a fun experiment only.

Python implementation

Let’s look at an implementation in python. A couple of notes here:

  1. We’ll allow ourselves to use functions with yield because it simplifies things a great deal.5
  2. You’ll notice an excessive use of while loops. That’s because list comprehensions, range, and most things you’d consider using in a for loop are banned from our int-only dialect. All of those end up being replaced with while loops here.

Prime generation

The first function we’ll write is an iterator that yields primes in order. This will be useful throughout. This implementation is the simplest implementation that works. I might write an entire post in the near future about algorithms for generating primes, because it’s such a cool topic, and a venerable field of research in itself. The algorithm most people know is the Sieve of Erathosthenes, but that’s just the tip of the iceberg.6

For today, a very naïve implementation will do.

def primes(starting: int = 2):
    """Yield the primes in order.
        starting: sets the minimum number to consider.
    Note: `starting` can be used to get all prime numbers
    _larger_ than some number. By default it doesn't skip
    any candidate primes.
    candidate_prime = starting
    while True:
        candidate_factor = 2
        is_prime = True
        # We'll try all the numbers between 2 and
        # candidate_prime / 2. If any of them divide
        # our candidate_prime, then it's not a prime!
        while candidate_factor <= candidate_prime // 2:
            if candidate_prime % candidate_factor == 0:
                is_prime = False
            candidate_factor += 1
        if is_prime:
            yield candidate_prime
        candidate_prime += 1

Create empty list

def empty_list() -> int:
    """Create a new empty list."""
    # 1 is the empty list. It isn't divisible by any prime.
    return 1

Yield elements

def iter_list(l: int):
    """Yields elements in the list, from first to last."""
    # We go through each prime in order. The next value of
    # the list is equal to the number of times the list is
    # divisible by the prime.
    for p in primes():
        # We decided we will have no trailing 0s, so when
        # the list is 1, it's over.
        if l <= 1:
        # Count the number of divisions until the list is
        # not divisible by the prime number.
        num_divisions = 0
        while l % p == 0:
            num_divisions += 1
            l = l // p  # could be / as well
        yield num_divisions

Access element

def access(l: int, i: int) -> int:
    """Return i-th element of l."""
    # First we iterate over all primes until we get to the
    # ith prime.
    j = 0
    for p in primes():
        if j == i:
            ith_prime = p
        j += 1
    # Now we divide the list by the ith-prime until we
    # cant divide it no more.
    num_divisions = 0
    while l % ith_prime == 0:
        num_divisions += 1
        l = l // ith_prime
    return num_divisions


def append(l: int, elem: int) -> int:
    # The first step is finding the largest prime factor.
    # We look at all primes until l.
    # The next prime after the last prime factor is going
    # to be the base we need to use to append.
    # E.g. if the list if 18 -> 2**1 * 3**2 -> [1, 2]
    # then the largest prime factor is 3, and we will
    # multiply by the _next_ prime factor to some power to
    # append to the list.
    last_prime_factor = 1  # Just a placeholder
    for p in primes():
        if p > l:
        if l % p == 0:
            last_prime_factor = p
    # Now get the _next_ prime after the last in the list.
    for p in primes(starting=last_prime_factor + 1):
        next_prime = p
    # Now finally we append an item by multiplying the list
    # by the next prime to the `elem` power.
    return l * next_prime ** elem

Trying the functions out

You can open a python, ipython or bpython session and try these functions out!

You should use numbers between 1 and 10 for the list elements. If you use larger numbers append and access can take a very long time. In part, this is an impractical fact about using Gödel numbering for lists, although the range of viable element values can be improved a great deal by optimizing the prime generation and factorization algorithms.

In [16]: l = empty_list()

In [17]: l = append(l, 2)

In [18]: l = append(l, 5)

In [19]: list(iter_list(l))
Out[19]: [2, 5]

In [20]: access(l, 0)
Out[20]: 2

In [21]: access(l, 1)
Out[21]: 5

In [22]: l
Out[22]: 972

Other int encodings

We saw one way in which we can represent lists of natural numbers as ints. There’s other, more practical ways that rely on subdividing the binary representation of the number into variably-sized chunks. I’m sure you can come up with how that would look.

In the future I might write other posts about better algorithms for generating primes and factorizing numbers, and also about int-representations of other complex data structures.


  1. I would assume the implementation breaks up at some point even before you run at out memory, but the documentation does explicitly mention they have unlimited precision.
  2. Note this is true of python3 but not of python2. In the case of python2, ints are fixed-size. I think it’s safe to just say python and mean python3 in 2020, but I also think this detail is worth a footnote.
  3. For the case of Gödel numbering for lists, it’s actually easy to argue that it’s a particularly bad representation. We’ll talk a bit about the trade-offs in the representation later in the blogpost.
  4. We could store the length of the list in a separate int, so that we know how many 0s at the end of the list to take into consideration. If we don’t want to have a whole separate int, we can always write the length of the list as the exponent of 2 and start the actual list with the exponent of 3. This has some redundant information, though. The way to avoid redundant information is to store the number of final 0s in the list, instead of the entire length. We won’t be worrying about any of this, though.
  5. Note that using yield is no different from using return and taking as an argument a state variable (it’s often enough to have the last returned element). This is a bit like Continuation Passing Style. Also similar to the usual accumulator hack for making non-tail recursive functions tail-recursive. If you’ve never heard of the accumulator trick, here are some links [1], [2] I might one day write about imitating iterators in languages that don’t have them.
  6. See also the paper The Genuine Sieve of Erathosthenes, which clears up a common confusion about how the algorithm is defined.

FBSim: football-playing AI agents in Rust

I took a two week vacation in early November. Somehow I decided to spend it learning a bit more about Rust and Reinforcement Learning (RL), a sub-field of AI that I haven’t explored much before. We won’t be talking about RL this post, though. That’s for a future blogpost.

All of that lead to me writing FBSim1, which I struggled to describe as “a game-like football simulation environment for trying out AI ideas, written in Rust“. As a game-dev framework, I used amethyst. Rust, amethyst and FBSim are all open source.

The idea is the following: you write functions defining how each of your players are to behave according to their role (forward, left, right, defender and goalie)2 and according to the environment, which includes information about the position of every teammate, every opponent, the ball and the net. FBSim does the rest for you.

In Rust terms, all you need to do in order to make your own AI for the game and see it play against other AIs is to implement the Engine trait –a trait is like a protocol in swift, an interface in Go, or an Abstract Base Class in python: essentially, a set of methods–, then you register your engine and change the configuration to make the players use it.

We’ll be implementing a simple AI for FBSim from scratch later on in the blogpost (beginners to Rust are welcome too), but first let me show you how the game looks like.

These are two very simple stock AIs playing each other.

BasicWingWait detroying Basic on a game

Exciting? No? Well… uhh… it’s a bit more exciting if you write the AI yourself.

Let’s go ahead with the implementation.

Get (to) the sources

First, let’s download the source code, and checkout the tag for the code that’s compatible with this blogpost.

git clone
cd fbsim
git checkout blogpost-1b # APIs may change in the future!

Get Rust (if necessary)

If you have Rust and cargo installed, you’re ready to go ahead. If not, then you should check out rustup. Once you’ve installed rustup, and used it to install the compiler (I’m using 1.47.0) and cargo, you’re good to go.

Note: FBSim uses const fn with a match statement, so you will need a relatively new version of the Rust compiler. If you’re getting errors mentioning const fn and match, then you probably need to update your compiler to a newer version.

Build FBSim


In order for amethyst to work, you should first install some system dependencies, as detailed here. Install the dependencies according to your distribution and then you can proceed to run the following.

source ./ # necessary due to a rendy bug with vulkan
cargo run


If you’re under macOS, you first need to change the Cargo.toml to use metal instead of vulkan.

Your final Cargo.toml file should look like this:

name = "fbsim"
version = "0.1.0"
authors = ["Ian Tayler <>"]
edition = "2018"
readme = ""
repository = ""
license = "MIT"

# See more keys and their definitions at

version = "0.15.3"
features = ["metal"]

version = "1.0.104"

rand = "0.7.0"
rand_distr = "0.3.0"

Now, you can save the file and run:

cargo run

I didn’t test this on a mac, so these instructions are based on speculation. Let me know if it doesn’t work for you.

Note: It was brought to my attention that some people were having the following issue mentioning gfx-backend-metal and xcrun on macOS. If that’s you, follow the linked issue to get a likely solution.


I think the simple cargo run should work by itself under windows. Let me know if that’s not the case!

Checking that it works

If it works, you should see something like the following screenshot. The players should be moving about. Some of them could be standing still. That’s normal.

What an FBSim Engine is

What we’re going to do now is to implement a new Engine and try it out against one of the stock ones.

Implementing an Engine amounts to implementing a bunch of functions that take as input a representation of the state of the game, and output a description of how the agent should behave next frame.

In essence, we’ll be telling our players what to do based on what they can see.

Engine input

As input we’ll get an EngineData struct. Here is how it’s defined.

/// Input for engine functions.
pub struct EngineData<'a> {
    pub ball_position: Vector2<f32>,
    pub own_position: Vector2<f32>,
    pub own: &'a player::Player,
    pub own_type: &'a player::PlayerType,
    pub own_net_position: Vector2<f32>,
    pub opponent_net_position: Vector2<f32>,
    pub teammates_position: Vec<(PlayerType, Vector2<f32>)>,
    pub opponents_position: Vec<(PlayerType, Vector2<f32>)>,

If you’re new to Rust. You can ignore the 'a. It’s a lifetime parameter. The rest should look somewhat familiar if you’ve used languages like C, C++, Go, etc. All the pubs just mean we want the fields and the struct to be publicly exported. Let’s look at the fields we actually care about.

ball_position: Vector2<f32>

This will have the absolute position of the ball –(0, 0) is the lower left of the screen–. You can access each coordinate by using Vector2‘s x and y fields. Vector2s also have special methods defined and most operations (+, -, *) behave like you’d expect if you’ve worked with numpy and similar libraries.

The actual Vector2 type is defined in the nalgebra crate3. The <f32> part just means the values in each coordinate are 32bit floats.

own_position: Vector2<f32>,

Same as above, but this will have the currently-running player’s position.

own_net_position: Vector2<f32>,
opponent_net_position: Vector2<f32>,

You get the idea, right? Positions of the centers of both nets. opponent_net is where you are trying to score. These are also always absolute positions, so they will be the same value every time.

teammates_position: Vec<(PlayerType, Vector2<f32>)>,
opponents_position: Vec<(PlayerType, Vector2<f32>)>,

Here we have a vector (Vec, similar to C++’s vectors, python’s lists, etc.) having the positions of all your 4 teammates (in the case of teammates_position) or the positions of your 5 opponents (in the case of opponents_position); along with the respective role of each of those opponents or teammates –each element is a pair with role first and position later–. We will not be using these two fields for the rest of this post, but if you plan on implementing a good engine, you probably should.

Engine output

/// Return type of engine functions.
pub struct EngineTransition {
    /// Velocity vector (pixels per second). Magnitude will
    /// be cropped to player speed!
    pub velocity: Vector2<f32>,
    /// Which action to activate (if any).
    pub action: Option<player::ActionType>,

This is what our functions will return. We define our player’s velocity in the x and y axis, and we set the player’s action, if any.

The action field will be either Some(ActionType::Kick) if we want our player to kick the ball if it collides with them next frame and None if we want the player to gently push the ball forward.4 There are no other actions in this version of FBSim.

Engine trait (i.e. methods)

pub trait Engine {
    fn goalie(&mut self, engine_data: EngineData) -> EngineTransition;
    fn forward(&mut self, engine_data: EngineData) -> EngineTransition;
    fn left(&mut self, engine_data: EngineData) -> EngineTransition;
    fn defender(&mut self, engine_data: EngineData) -> EngineTransition;
    fn right(&mut self, engine_data: EngineData) -> EngineTransition;
    fn dispatch(&mut self, engine_data: EngineData) -> EngineTransition {
        match engine_data.own_type {
            PlayerType::Goalie => self.goalie(engine_data),
            PlayerType::Defender => self.defender(engine_data),
            PlayerType::Forward => self.forward(engine_data),
            PlayerType::Left => self.left(engine_data),
            PlayerType::Right => self.right(engine_data),

What this says is we need to implement methods called goalie, defender, left, right and forward. Each will govern how each of our roles play. The dispatch method is already implemented by default, and it’s unlikely you’ll want to change it.

The only difference between roles is the position they start on in the field after each goal and the fact that goalies have a bigger hitbox.

SimpleEngine trait

When convenient, you can use SimpleEngine, which only asks that you implement an engine_func method and implements all the Engine methods for you.

This can be useful when you will use exactly the same code for all your roles. You just write that code once in engine_func and SimpleEngine uses it for goalie, left, right, etc. indistinctly.

pub trait SimpleEngine {
    fn engine_func(&mut self, engine_data: EngineData) -> EngineTransition;

This is what we’ll be using. But you’ll likely want to have different methods for your different roles if you implement something more complex.


Minimal Engine implementation

We’ll create a file src/engines/ That’s where our entire engine will live.

We’ll first create the silliest engine possible: we’ll always stay put and not do any actions. Later we’ll make a less silly engine.

// src/engines/
// First import everything we need.
// "crate" are imports relative to our own code.
// So crate::components is under src/components, for example.
use crate::{        
    engines::{EngineData, EngineTransition, SimpleEngine},
// We don't need any fields. We just define an empty struct.
pub struct MyEngine;    

// We use `impl` to implement the `SimpleEngine` trait.
impl SimpleEngine for MyEngine {
    fn engine_func(&mut self, engine_data: EngineData) -> EngineTransition {
        // In Rust, the last expression in the function is the return value!
        // Here, we return an EngineTransition, as the trait defines.
        // Remember not to add a semicolon (;) for the return expression
        // as that evaluates to an empty value ()!
        EngineTransition {
            // Stay put! Moving is dangerous!!!! (´・_・`)
            velocity: math::Vector2::new(0.0, 0.0),
            // Don't act! Actions have consequences!!!!! (´・_・`)
            action: None,

Exporting our engine

Now we need to register our new engine in the EngineRegistry. For that, we’ll need to export our MyEngine struct and import it in the registry.

We export our new file using pub mod in the module’s file.

// src/engines/
pub use self::engine::{Engine, EngineData, EngineTransition, SimpleEngine};

pub mod basic;
pub mod engine;
pub mod myengine;

Adding our Engine to the registry

Now we add it to the registry in src/resources/ First step is importing it.

// src/resources/
use crate::engines::{          
    basic::{Basic, BasicWingWait},  
use std::collections::BTreeMap;

And then, later down the same file, we add an insert in the definition of EngineRegistry::default.

// src/resources/, line 41
impl Default for EngineRegistry {
    fn default() -> Self {
        let mut registry =
            EngineRegistry::new(BTreeMap::<String, Box<dyn Engine + Send + Sync>>::new());
        registry.insert("basic".to_string(), Box::new(Basic::new()));
        registry.insert("myengine".to_string(), Box::new(MyEngine {}));

Here we added an entry to a map from a String to a Box<dyn Engine>.5,6

Two things require explanation for people less familiar with Rust: Box and to_string. If you already know about them, skip to the next section.

We need to use a Box around our Engine due to the way Rust handles dynamic dispatch to structs. Box allocates our struct in the heap and allows us to handle structs of different sizes in the same data structure.

The reason why we need to call the to_string method on "myengine" is because string literals in Rust are not really of type String but rather of type &str. This is an optimization that avoids heap-allocation of literals, but the fact that they’re of a different type means we have to call to_string() on them if we need a heap-allocated String.

Changing configuration to use myengine

Currently, the best way to use a different engine is to change the configuration in assets/player.ron. This will change the engine used by the upper-side team. The lower-side team is governed by the configuration in assets/enemy.ron.

I’ll add a command-line parameter for this in the future.

For now, what you need to do is change the value in EngineRunner to "myengine".

// assets/player.ron line 123
extras: PlayerData(
    player: (
        speed: 48.0,
        kick_strength: 256.0,
        push_strength: 96.0,
        side: UpperSide,
    robot: Robot(logic_module: EngineRunner("myengine")),

Running our small engine

All that’s left to do is to run cargo run again. You may need to do source ./ again if you’re running linux and you closed your first terminal.

You should see the upper-side players stay put while the lower-side team does its best to score.

Improving the engine

Okay. That wasn’t much. Let’s go ahead and write an engine that at least tries to play the game. We’ll modify MyEngine directly so that we don’t need to worry about registering another engine.

Here’s the code:

// We use `impl` to implement the `SimpleEngine` trait.
impl SimpleEngine for MyEngine {
    fn engine_func(&mut self, engine_data: EngineData) -> EngineTransition {
        // The idea behind this engine is simple:
        // 1. If we're close to the ball, run towards it and kick it!
        // 2. If we're far away from the ball, run to your own net! Defense!
        // First compute what's the difference in position with the ball.
        let difference_with_ball = engine_data.ball_position - engine_data.own_position;
        // Vector2 from nalgebra implements `norm()` which computes the euclidean norm.
        if difference_with_ball.norm() > 100.0 {
            // We're far away from the ball. Find in which direction is your net!
            let difference_with_own_net = engine_data.own_net_position - engine_data.own_position;
            // Now run in that direction as fast as you can!
            EngineTransition {
                velocity: difference_with_own_net * engine_data.own.speed,
                action: None,
        } else {
            // We're close to the ball!
            // ATTAAAAACKKKK!!!!!
            EngineTransition {
                velocity: difference_with_ball * engine_data.own.speed,
                action: Some(ActionType::Kick),

I didn’t explain what engine_data.own was before. It just holds a bunch of constant data about the player: their speed, how hard they can kick the ball, etc.

For those of you unfamiliar with Rust, the reason why this method works and returns the right value from inside the if ... else ... is because if-else itself is an expression, and that expression evaluates either to the last expression of the if block in case the condition is true, or the last expression in the else block, if the condition evaluates to false. In other words, if-else behaves like the ternary operator A ? B : C does in other imperative languages.

If you save those changes and run cargo run you should see a bunch of players desperately running toward their own goal and kicking the ball away as hard as they can when it gets close to them.

The code for MyEngine, including the changes to the registry and the configuration can be found in the branch blogpost-1-implemented of the github repo for FBSim.

If you want to keep improving MyEngine, with a little bit of work you should be able to beat basic and basic_wing_wait without having to do anything fancy. You can look at their implementation under src/engines/ for inspiration. Remember to change assets/enemy.ron to use basic_wing_wait instead of basic if you want to play against it. Think of basic as level 1 and basic_wing_wait as level 2.

Future steps for FBSim

This was a fun project during my two-week vacation. Nothing serious. If I keep working on it, this is what I’d like to do.7

  • Adding support for writing engines in scripting languages. Was thinking of python and lua. Might try to throw in mun just for fun.
  • Training Reinforcement Learning agents for FBSim. Might do it in Rust directly or wait until I implement python scripting depending on the state of the Rust ecosystem for ML/RL.
  • Adding support for self-managed tournaments between a group of engines.
  • Maybe more game-like features to lower the entry barrier?

Big thanks to my friends Elías Masquil and Joaquín Alori who shared very helpful comments about earlier versions of this blogpost and FBSim.


  1. I do know “FBSim” is the worst name ever. Don’t judge me.
  2. Raise your hand if the mention of a goalie was the first time you realised I was talking about soccer. If you need to localize the game to North America, I think changing the sprites and the background colour to make it look like ice hockey shouldn’t be too difficult.
  3. When using amethyst, you should probably not import nalgebra directly as you can get most symbols by importing amethyst::core::math, which re-exports a lot of stuff from nalgebra.
  4. If you’ve never heard of an Option type, it’s a type that encodes a value that may not exist, in a similar way of how you’d use None in python, sentinel values in C, and other contraptions in other languages. The difference is you need to also mark the case where the value does exist as Some(value), and not just value. This way, the compiler can check whether you’re making the mistake of treating a value that may not exist as if it always existed. It’s a cool little idea that’s most common in functional programming languages, but has been popular with imperative languages as well recently.
  5. It’s actually Box<dyn Engine + Sync + Send> but let’s not go into unnecessary detail.
  6. You can ignore the dyn if you don’t know what that is. It’s just telling Rust that we don’t know at compile time the exact type of the objects in the boxes: we only know about a trait they implement.
  7. Not counting the small changes like adding a command-line argument for setting which engine to use, etc.

A 14th century proof of the divergence of the harmonic series

Nicole d’Oresme was a philosopher from 14th century France. He’s credited for finding the first proof of the divergence of the harmonic series. In other words, he authored the first proof we know of for the fact that 1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + ... is infinite.

His proof is very simple, so much so that I think probably someone with more talent for educating than me could probably teach it to middle school students. I think that’s really cool.

Formally, the sum we’re talking about is the sum of 1/k for all positive integers k. What d’Oresme did was proving that for any number C, you can find a finite k so that the sum of the first k terms (i.e. 1 + \frac{1}{2} + \frac{1}{3} + ... + \frac{1}{k-1} + \frac{1}{k}) in the original sum is larger than C. If the first k terms sum to more than C, then the entire sum must be larger than C as well (because there are no negative numbers in the sum). That means the sum is larger than any finite number and, therefore, it has to be infinite.

But how do you find the right k that will make your sum large enough? Well, d’Oresme used the following fact: if you take all the 1/k numbers with k>=n and k<2n, then you have n numbers that are all larger than \frac{1}{2n}. The sum of all n of them must be larger than \frac{n}{2n} = \frac{1}{2}.

The above proves that the first number (i.e. with k>=1 and k<2*1) has to be larger than \frac{1}{2}. From 2 to 3 (i.e. k>=2,k<4) the sum should also be larger than \frac{1}{2}. The numbers from 4 to 7 also sum more than \frac{1}{2}. In other words, the sum of numbers from any power of 2 to the next power of 2 should sum over \frac{1}{2}. It’s easy to see from this that the first 2^{2C} numbers in the series should sum something that’s larger than C.

In other words, the k such that the sum of the first k terms is higher than C is k=2^{2C}. And that, in a nutshell, is Nicole d’Oresme’s proof of the divergence of the harmonic series.

Nicole d’Oresme studied a bunch of cool topics. I think he, as pretty much all other medieval scientists, is an under-appreciated historical figure. I like reading about medieval scientists and their works, so I might post other interesting things from them in the future.

Do bear in mind that I’m not a historian, so any historical commentary I include in my posts about medieval scientists should be taken with a grain of salt.

Add files if using pkg_resources

If you’re working on a python package, and you’re using setuptools‘s pkg_resources for handling data files in your package, then you should add files to all your internal sub-directories. In fact, just make your life easier and always add packages everywhere as if you were using python2, because not having them also creates issues with some linters and other tools. files used to be obligatory for sub-packages up until PEP-420 –yes, 420– which was implemented in python 3.3. Now python itself can handle code in any subdirectories as modules without needing an file everywhere.

That said, certain libraries and tools don’t handle modules without files correctly. In particular pkg_resources behaves weirdly when you try to access data in a module that doesn’t have an file.

Suppose you have the directory structure below. In this situation, if you run resource_exists("mod1.mod2", "some_data_file.txt") –where resource_exists is imported from pkg_resources— while standing on the root of the directory tree below, it will raise a NotImplementedError, which is not the most helpful of messages.

└── mod1
    └── mod2
        └── some_data_file.txt

This is if you’re not working on an installed package. If you actually install it, you may get a weirder expecting str, bytes or os.Pathlike, not None kind of error. I’m not sure when you get one or the other, but the point is: if there isn’t an file, there’s a good chance you won’t get what you’re looking for.

To fix this you just have to create an file; ending up with the following directory structure; and then resource_exists("mod1.mod2", "some_data_file.txt") just returns True.

└── mod1
    └── mod2
        └── some_data_file.txt

The case of pkg_resources.resource_filename is even more fun. Check this out.

When ran on a non-installed package and if you don’t have the inner, resource_filename("mod1.mod2", "some_data_file.txt")will return "some_data_file.txt". So, just your second argument. Not even "mod1/mod2/some_data_file.txt". Just the final filename. Why it does this, I’m not sure.

If you do have the then it will return "/absolute/path/to/your/project/mod1/mod2/some_data_file.txt". This is the normal behaviour of this function. You usually want to open the output of resource_filename, so the path must be complete (although it doesn’t need to be absolute).

Note that if you actually pass it a module that doesn’t exist (as opposed to a module that exists but doesn’t have an file), it will raise a ModuleNotFoundError.

This means something like the following function works even without an Of course, I don’t need to tell you not to use a function like this one as it is the dirtiest hack around, and probably has its own set of bugs and edge-cases, but I though it was funny that something like this worked when I tested it.

def resource_filename(module_path, resource_path):
    pkg_resources_path = pkg_resources.resource_filename(module_path, resource_path)
    if "/" not in pkg_resources_path:
        mod = sys.modules[module_path]
        if isinstance(mod.__path__, list):
            module_fs_path = mod.__path__[0]
            # Hoping this behaves like a _NamespacePath.
            module_fs_path = mod.__path__._path[0]
        os.mknod(os.path.join(module_fs_path, ""))
        pkg_resources_path = pkg_resources.resource_filename(module_path, resource_path)
    return pkg_resources_path

On a different note, as I was writing this I also noticed that the .__path__ attribute of a module returns a List[str] if the module has an and a _NamespacePath, –whose attribute _path is a List[str]— if the module doesn’t. That’s fine because you really shouldn’t expect consistent behaviour from __*__ variables, but I thought it was a weird quirk. In a more general way, if you don’t have an, your module is called a namespace and looks different from a “normal” module in several ways.

The lesson is PEP-420 did not save us from files. Just use them everywhere and save yourself future problems.

Average Precision is sensitive to class priors

Average Precision (AP) is an evaluation metric for ranking systems that’s often recommended for use with imbalanced binary classification problems, especially when the classification threshold (i.e. the minimum score to be considered a positive) is variable, or not yet known. When you use AP for classification you’re essentially trying to figure out whether a classifier is any good by looking at how the model’s scores rank positive and negative items in the test set.

Typical scenarios where AP might come in handy include:

  • You’re classifying but the threshold is going to vary depending on external variables, so you can’t use normal thresholded metrics.
  • The classifier model is to be used both as a classifier and a ranker in different applications.
  • You’re building a pure classifier, but you’re interested in understanding how your classifier ranks results by its scores, because it could help you better understand where it’s going wrong, etc.

There’s one important property of AP which in my experience is rarely mentioned but incredibly important when used for classification problems:

Average Precision depends heavily on positive rate (i.e. class priors)

To some people this might seem obvious. Precision is famously skewed by positive rate, so surely Average Precision must be too! Well, that’s not a great argument. Although it reaches the right conclusion.

When I ran into this for the first time, I actually managed to convince myself of the opposite: that AP was independent of the positive rate. That’s wrong. This is how I tricked myself: “AP is the area under the precision-recall curve. If positive rate is high, then it will be easy to have high precision, but it will be hard to catch all the positive items (i.e. have high recall). If positive rate is low, then precision will be hard to get, but recall will be easier. Probably the changes in precision and recall cancel each other out to some degree.”

It took me a couple of minutes to realize the mistake in the above reasoning.1 Sloppy half-reasoning like that is often useful but dangerous. It’s important to always double check by either consulting the literature or trying to find a more rigorous argument or proof.

In this case, it only took me a couple of minutes to realize where the problem was: recall does not vary with positive rate, unlike what I implied. Of course, if the positive rate is higher, there’s going to be more items that you won’t catch, but the ratio of caught items is expected to stay the same.

Recall is independent of positive rate

Let’s visualize this with an example. In the following image we’ll position blue and orange dots on a line. Each dot represents an item in the test set. The farther to the right the item is, the higher the predicted score for that item is. Blue dots are positive examples, while orange dots are negative examples. Let’s look at what happens with precision (P) and recall (R).

Shows an example computation of precision and recall with a small toy dataset.
Precision and recall at a threshold

Now we’ll look at the same exact example, but we’ll undersample from the positive class. Note that the model giving the scores is exactly the same, all that’s changing is that we’re getting rid of some of the positive examples in the test set.

Notice how when we undersample from the positive class, precision can only go down. There is no positive example we can eliminate that will make precision go up. Recall, on the other hand, may go up or down depending on which side of the threshold we eliminate items from, but the expected result is that it will stay the same. More on that later.

Shows an example computation of precision and recall with the same dataset as above, but the positive class has been randomly undersampled. This causes precision to go down while recall stays the same.
Precision and recall at same threshold with blue undersampled

That example illustrates how when we reduce the positive rate in a test set, precision is expected to go down and recall is expected to stay the same. Of course, a single example with a small dataset doesn’t prove anything, but hopefully it helps visualize why precision and recall behave differently when we change the positive rate.

Another way to visualize how positive rate won’t affect your recall is to imagine having a dataset and then undersampling the negative class. When you undersample the negative class positive rate will go up, but recall won’t change, as the formula for recall (TruePositives / AllPositives) simply ignores all negative examples. In other words, recall depends only on the distribution of scores for items of the positive class, while precision depends on both the distribution of scores of the positive and the negative class, and also on the class priors.2

How this applies to AP

What we’re seeing then is that for a particular threshold, lowering the positive rate is expected to lower precision and maintain recall.3 In consequence, all values of the precision-recall curve are expected to go down, and therefore, AP –being the area under the precision-recall curve— is expected to go down as well.4

When could this become a problem? There’s two big cases.

  1. Comparing the performance of models tested with different data.
  2. Comparing the performance of a model in different subsets of a test set.

Suppose the following case: your classifier is retrained periodically with new data, and you want to know whether the model got better or worse. It’s dangerous to compare APs for different test sets because if the positive rate changed from one to the other, then you may have a big swing in AP that’s not really related to how good the model is!

One option to mitigate this could be to test all models with the same dataset, or to create carefully crafted stratified test sets that keep these properties constant artificially.5 Another option is to use AUC-ROC or other metrics that don’t have this problem instead of AP. But obviously, they come with their own caveats.

The other case is when you want to check for which types of items your model performs well or bad. If you simply compute AP for different subsets of your test set, you need to be very careful that the positive rates in those different subsets stay the same! If not, AP might give you a completely wrong idea.

To make it easier to visualize: if the positive rate is 50%, a completely random model is expected to have an AP of 0.5. On the other hand, if the positive rate is 2%, AP of a random model will be 0.02. Getting an AP of 0.5 in that case might be incredibly hard.6

Next time you’re working with AP in a classification context, be weary of the effect positive rate has on it.


  1. Apart from the obvious fact that having two things pulling in opposite directions is not enough to say they cancel each other out.
  2. With threshold t and the scores of the positive class being sampled from a distribution X_p, the expected recall for a large enough test set will be p(X_p >= t). That’s invariant to class priors, and to the underlying distribution of positive class items in the test set.
  3. “Lowering the positive rate” is reducing the ratio of the test set that is of positive class. It’s assumed that the underlying class-conditional distributions stay the same. Crucially, though, if in real life you see your positive rate change significantly, this will likely be accompanied by a change in class conditional distributions as well! Analysing such cases is complex. Understanding that AP reflects positive rate is just the first step in avoiding pitfalls when you have varying positive rates in test sets.
  4. There’s an important interaction that I decided to leave out of this blogpost for simplicity: if by undersampling we cause the minimum score for positive class items to change (e.g. if by undersampling the positive class, we get rid of the positive items with the lowest score) then AP could go up even if precision at all thresholds goes down! That’s because if the minimum score for ground-truth-positive items goes up, low thresholds (which usually have bad precision) will stop being taken into consideration for computing AP. All of this is not too important because for big-enough, dense-enough, natural datasets this effect should be much smaller than the one introduced by the change in precisions at each threshold. Nevertheless, I found it interesting.
  5. Of course, doing this makes evaluation metrics harder to interpret, as they no longer mean simple things like “the probability of a classified-positive to be ground-truth-positive”. Disentangling the meaning of an evaluation metric in a stratified test set requires some work. Also, doing this might penalize you too much for poor performance in very rare items.
  6. Interpreting some value of some evaluation metric to be “good” or “bad”, “hard” or “easy” depends on a lot of different factors. Here, “hard” is just intended to mean “much better than random guessing”.

On ways in which DynamoDB won’t scale

Note: I wrote this post in some other blog on June 2018, then moved it over here.

A lot of the work I’ve done for the past two months has revolved around one huge DynamoDB table and a few smaller tables complementing the huge one. I thought I’d share some of what I learned in the process.

I’m not going to be talking about the basics of DynamoDB (eventual consistency, avoiding scans and hot partitions, keeping items under 1KB, etc.). For the most basic advice, you should check out the DynamoDB docs.

Spoiler alert: the main takeaway from this post will be the following.

“DynamoDB will not scale unless you know what you’re doing from the very beginning.”

A lot of people have mentioned this already 1 2 3. Hot partitions
are probably the biggest problem here, but I’m not going to talk about that

Maximum query return size

Queries return up to 1MB of data (and this is before any filtering is done).

Now, imagine you’re building some kind of social network where people post photos. You’ll save the following data.

  • Entity: Photo
    • UserId
    • Timestamp – moment when the photo was added to the database.
    • Thumbnail
    • PhotoLink – something to help you retrieve the actual photo stored in S3.
    • Likes – number of likes/shares/upvotes/whatever
    • Title
    • Description

And your most important queries will be to get the last few photos by a
specific user.

Now, if you don’t know what you’re doing and decide to believe in the allegedly seamless scalability and ease of use of DynamoDB, you may consider having a single DynamoDB table with UserId as the partition key and Timestamp as the sort key, and having everything else as attributes. That way you can make a Query with Limit=K and ScanIndexForward=false to get the last K photos of a user.

Of course, DynamoDB will overwrite any items with the same UserId and Timestamp, which is something you probably don’t want to happen. But you can fix that with a fixed-size random postfix to the Timestamp or something of the sort. Let’s ignore that in the name of simplicity.

Now suppose you also want to add a feature where users can see their most-liked photos. This is where the 1MB limit in the query return size kicks in. You could implement it by querying for all photos by a user and sorting them by Likes in-memory. But, if the number of photos by a user grows so that you have more than a few MB of data for a single user, you’ll have to get that data in several sequential requests with 1MB responses each, instead of a single query, and that will degrade the user experience significantly.

In the case I’ve been working on, there was no noticeable difference between getting 0.1MB of data (i.e. setting a pretty low Limit) or 0.9MB of data in a query, but when we went over 1MB and had to do 2 sequential requests, the time pretty much doubled.

Global Secondary Indexes are incredibly expensive

To do that query efficiently you’ll have to add a Global Secondary Index with UserId as the partition key and Likes as the sort key which essentially doubles the cost of your table.

So why not adding a local secondary index instead? First of all, that won’t save you a lot of money if your expenses come mostly from write capacity. But there’s something worse.

ItemCollections are capped at 10GB

If you’re not very aware of this restriction, you could end up in a world of

When you have a local secondary index, you can’t have a collection of items with the same value in the partition key exceeding 10GB in total. In other words, applying the concept to this case, you can’t have more than 10GB of data for a single user (i.e. a collection of items with the same value in UserId).

In this case, it’s fairly unlikely normal users will ever reach that limit, but a bot posting every 30 seconds could get there quite fast (depending on the item sizes). And in other applications the limit could be very much within reach.


You could have thought that the entity we were trying to model (Photo) required simple enough queries that it had to be trivial to model with DynamoDB, but you would’ve thought wrong. If you need to make queries that are at all more complex than just retrieving a single entry given a key, you should not take the decision of using DynamoDB lightly.