The threshold is usually set to , but this needs not be the best choice. If can be assumed to be well-calibrated^{1} –which usually isn’t the case, though– and the cost of a *false positive* and a *false negative* are known constants (call them and ), then a theoretical cost-minimizing threshold can be computed with the formula . That means is the optimal threshold under these conditions *only* when . For imbalanced classification, usually that equality does not hold. And even in balanced settings, it’s more likely that it won’t.

Now, that formula won’t save us every time. We need something else. Sometimes our model is not well-calibrated, *as even when we use calibration methods on it, we may not trust that the result is actually properly calibrated*. In other cases the costs for *false positives* and *false negatives *may not be known or may not be constant. What’s the cost of saying a dog is a cat if you’re a photo-categorizing software? Or the cost of wrongly tagging a sports news article as politics? It’s not always easy to quantify those costs.

So, suppose we wanted to use some other strategy to select the classification threshold. We could try to find instead the threshold that maximizes the score in our validation set, or some other evaluation metric. How would we go about doing that?

The rest of this blogpost deals with finding thresholds that maximize (or minimize) classification metrics in time where is the size of the validation set used for optimizing the threshold.

The algorithm itself is very simple, but analyzing it seriously can help paint a clearer picture on how changing thresholds affect our metrics, and about what’s going on when we choose to use thresholded metrics for our classification models.

The general idea of how you get to optimize these thresholds fast is used in a bunch applications in Data Science. Something very similar to what I’m describing is used to find splits when building CART trees by `scikit-learn`

^{2} (the similarity to this problem is pretty obvious if you remember how CART works) and is also used when building the precision-recall curve and related metrics like average precision, also by `scikit-learn`

.

Let’s start by visualizing examples of the functions we’re trying to optimize.

Remember we’re trying to find **the best threshold** so that the resulting target/no-target classification *using that threshold* **maximizes** some evaluation metric like accuracy, F1, etc.^{3}

Here the blue line is the function itself. The *x* axis represents all possible thresholds, and the *y* axis has the values of the evaluation metric being considered.

We also added green circles and red crosses. They represent our output scores for items in our dataset. In these examples the dataset –one would usually use the validation set– has 5 elements of *target* class and 5 of *non-target* class.^{4} These dots and crosses are not really part of the function itself, but visualizing the dataset together with the function helps (me) understand what’s going on.

Summarizing: the blue line is the evaluation function as we change the threshold. Circles and crosses mark scores for elements in our dataset of *target* and *non-target* class respectively.

Note how the value of our metric only changes when we move the threshold “over” some score in the dataset, changing the TP or FP count (more on that later). This will be true of all metrics that can be defined as a function of TP, FP, TN and FN.^{5}

Looking at the shapes of the functions can be pretty discouraging. They don’t belong to any of the well-known function types that are easy to optimize. They’re clearly not linear or affine, they’re not convex, in fact, not even continuous, and in the entire domain the derivative is either undefined or 0, so all gradient-based methods are out.

But, they have a very lovely property which is that **we only need to evaluate a finite number of points to know all values of the function**. This can be seen in the example functions, and isn’t hard to prove more rigorously.

That property of only having to evaluate a finite number of thresholds takes us all the way from *not even knowing how to find the best threshold* to having an *obvious quadratic time algorithm*: just try every possible value and pick the best one! Shown here as python-ish pseudo-code (remember vectorization will come later).

```
def find_threshold_quad(y_proba, y_val, eval_func):
"""Find a threshold that maximizes eval_func."""
evals = np.empty_like(y_proba)
for i, threshold in enumerate(y_proba):
# confusion_matrix runs in linear time
# w.r.t dataset size.
tp, fp, tn, fn = confusion_matrix(
y_val, y_proba >= threshold
)
# assume eval_func runs in constant time!
evals[i] = eval_func(tp, fp, tn, fn)
best_threshold_idx = np.argmax(evals)
return y_proba[best_threshold_idx]
```

Something to note here: we’re saying that `eval_func`

should run in constant time but `scikit-learn`

style thresholded evaluation functions (like sklearn.metrics.precision_score) take linear time usually. How can that be?

That’s because `scikit-learn`

style functions count the TPs, FPs, etc. *as part of the evaluation function itself*. For the purpose of this blogpost we are using evaluation functions that *take* the TP, FP, TN and FN as pre-computed values. We will assume the `eval_func`

s are constant time functions for the rest of this blogpost. Accuracy, precision, recall, F1, g-mean, and most other thresholded metrics have obvious constant time implementations as functions of TP, FP, TN and FN. For example, accuracy is just , so it always involves 3 sums and a division.

We can improve on the above algorithm’s time complexity, though. The trick is the following: if our list of scores is ordered, then there is no need to fully recompute the TPs, FNs, etc. for every single threshold, they don’t really change that much when we move from smallest to largest (or largest to smallest).

Assume all scores are different from each other to simplify the argument and the pseudo-code. We’ll deal with the problem of repeated scores later when we vectorize our code.

For the smallest threshold, i.e. when we are using the smallest predicted score as the threshold, everything is classified as *target* class, as every score (including itself) is greater than or equal to the smallest one. That means TN and FN are 0, while TP and FP are the count of elements of ground-truth *target* class and *non-target* class respectively.

While we move to the right, for each actual we pass, we only need to update those numbers (TP, FP, etc.) by incrementing or decrementing them based on the class of the previous item. Each time we “pass over” a score, if that was of class *target*, we lose a TP. If it was of class *non-target*, we gain a TN.

This realization that *we don’t need to recompute , etc. from scratch every time* let’s us get rid of our **linear-time** `confusion_matrix`

function, and replace it with **constant-time** modification of our `tp`

, `fp`

, `tn`

and `fn`

variables.

We don’t quite manage to drop time complexity to linear time over the entire optimization because we need to sort the scores to do this. After we sort them, the actual optimization itself is just linear time, though.

```
def find_threshold_loglinear(y_proba, y_val, eval_func):
"""Find a threshold that maximizes eval_func."""
sort_idx = np.argsort(y_proba)
y_proba, y_val = y_proba[sort_idx], y_val[sort_idx]
evals = np.empty_like(y_proba)
tp = y_val.sum()
fp = (y_val == 0).sum()
tn, fn = 0, 0
for i, threshold in enumerate(y_proba):
# assume eval_func runs in constant time.
evals[i] = eval_func(tp, fp, tn, fn)
# Note we modify _after_ evaluation.
if y_val[i] == 1:
tp -= 1
fn += 1
else:
fp -= 1
tn += 1
best_threshold_idx = np.argmax(evals)
return y_proba[best_threshold_idx]
```

That’s pretty much all there is to it. In the following section we’ll present a vectorized numpy implementation of this same algorithm, which also covers some edge-cases we ignored here.

In particular, think what happens when the best threshold is one that classifies *everything* as *non-target*. Is that case handled in the above pseudo-code? Also think about the case where a score is repeated in the dataset. Are we handling those cases correctly? How could that be fixed?

```
def find_threshold(y_val, y_proba, eval_func_vec):
"""Vectorized threshold optimization.
This uses `np.cumsum` to vectorize the conditional
for-loops we used in the naïve version.
"""
# We start by sorting the probas in descending
# order. (That's what the `-` is for)
sort_idx = np.argsort(-y_proba)
y_proba, y_val = y_proba[sort_idx], y_val[sort_idx]
# We need to add an np.inf as a possible proba to
# handle the case where the best threshold is one
# where all predictions are of _non-target_.
y_proba_completed = np.concatenate(
[[np.inf], y_proba]
)
unique_idx = np.concatenate(
[
np.where(np.diff(y_proba_completed))[0],
# Due to the way np.diff works, we need
# to add the final element by hand.
[y_val.size]
]
)
# These 0s here are added for getting the right
# value with `cumsum` for `np.inf`.
y_val_completed = np.concatenate([[0], y_val])
y_val_completed_reverse = np.concatenate(
[[0], y_val == 0]
)
tps = np.cumsum(y_val_completed)
fns = np.sum(y_val_completed) - tps
fps = np.cumsum(y_val_completed_reverse)
tns = np.sum(y_val_completed_reverse) - fps
# This eval_func_vec should compute the metrics
# for several TP, FP, TN, FN configurations at once.
values = eval_func_vec(
tps[unique_idx], fps[unique_idx],
tns[unique_idx], fns[unique_idx],
)
# nanargmax for dealing with undefined metrics,
# like precision above highest score.
best_unique_idx = np.nanargmax(values)
best_idx = unique_idx[best_unique_idx]
return y_proba_completed[best_idx]
```

The last question to answer is: how do you implement an `eval_func_vec`

? Due to the overloaded syntax used in numpy, it’s usually very simple. A couple of examlpes follow.

```
def accuracy_vec(tps, fps, tns, fns):
return (tps + tns) / (tps + tns + fps + fns)
```

```
def geometric_mean_vec(tps, fps, tns, fns):
sensitivities = tps / (tps + fns)
specificities = tns / (tns + fps)
return np.sqrt(sensitivities * specificities)
```

The vectorized `find_threshold`

above could be used for finding splits in algorithms for building decision trees, or, with small modifications, it could be used for creating precision-recall curves and other metric-to-metric curves fast.

Similar `cumsum`

-type optimizations can be applied to time series data when you’re counting items before some time, accumulated averages, etc. The idea behind this algorithm (and vectorization) is useful in a variety of situations in Data Science, and is worth keeping in mind.

- Well-calibrated intuitively means that if we take the set , then the ratio of items of actual
*target*class in should be close to . In other words, when a model is well-calibrated, can be interpreted as the probability that is of class*target*. - Here’s a link to an example place where sklearn uses the idea for trees and here’s how they use it for computing precision-recall curves and metrics that depends on counting TPs and FPs at all thresholds. Both those pieces of code have a lot of surrounding type-conversion, edge-case-matching, optional-parameter-considering code, so looking at the implementation in this blogpost first may help separate the important parts of that from the noise.
- When I say “the best threshold” I actually mean “one of the thresholds that are the best”. In other words, a threshold such that no other threshold is strictly better.
- In case the example is misleading, I want to clarify that the datasets needn’t be balanced. It
*happens*to be balanced in these examples, but I could have chosen an imbalanced dataset as well. - Alternatively, in terms of TP, FP, positive count, and negative count. Or other things from which we can compute the confusion matrix.

`int`

s, unlike in `C`

, `Rust`

or `Go`

, are of arbitrary size.For example, you can open `python3`

and run the following.

```
>>> import math
>>> math.factorial(2020)
[number omitted]
>>> math.log2(math.factorial(2020))
19272.453841606068
>>> type(math.factorial(2020))
<class 'int'>
```

So a normal, every-day `int`

in `python`

can easily store a value that would take up `19273`

bits in a C-style fixed-size unsigned int type. In a language like `python`

where convenience is valued over speed and memory efficiency, that’s really useful.

This *unlimited precision* also means we can store an *arbitrary amount of information* in a single `int`

. With the right coding, an entire book, an entire database, or anything else fits in a *single * `python`

`int`

.

We can therefore imagine a dialect of `python`

where we *only* have ints, and we need to represent everything else (`dict`

s, `list`

s, etc.) just by using `int`

s. We’d also have special functions and methods that would *treat* `int`

s as if they were `list`

s, `dict`

s, etc. This can be a fun exercise, and that’s what we’ll be doing in this post.

There’s an obvious way to do this: all data structures are just bit-arrays in memory. Worst-case scenario, it’s a set of related bit-arrays (like each node in a linked list or tree, for example), and sets are just bit arrays as well. Bit arrays can be interpreted as binary numbers. So we always have this option. But that’s a bit boring.

For this post, and for the following posts in this series I’ll present some ways of representing complex data structures as `int`

s that I think are *interesting.* They are not necessarily the most compact, the most reasonable, or the most efficient. The common objective is to find *interesting* or *fun* representations of these data structures.^{3}

The first data structure we’ll work on representing is `list`

. What we’ll use is called Gödel numbering, named after the logician Kurt Gödel. We’ll be dealing with lists of *unsigned integers* (i.e. naturals) only, for convenience.

Gödel numbering works by abusing the fact that each natural number greater than 1 is *uniquely represented* by its prime factorization. This is given by the *fundamental theorem of arithmetic*.

Looking at some examples.

- .
- .
- .

A number is uniquely identifiable by the *list of the exponents of its prime factors* (up to its highest non-zero exponent). So we can say represents the list `[1, 2, 0, 1]`

. The first place in the list is the exponent of in the prime decomposition of , the second value is the exponent of , and so on.

A couple of examples:

What happens if you have a `0`

at the end of the list? Well, with this encoding you can’t. There’s an infinite stream of primes with exponent `0`

in our factorization, so we need to stop somewhere.^{4} We chose to stop at the last non-zero entry.

This representation uses *very* large numbers when the list contains somewhat large numbers. That’s because the numbers in our list become exponents, so the size of the int grows exponentially with them. For example, `[50, 1000, 250]`

uses a number with size `2266`

bits. On the other hand, relatively long lists of *very* small `int`

s, and especially large sparse lists (i.e. where most of the values are 0) have very compact representations compared to other `int`

-encodings of `list`

s.

Keep in mind this is not a practical encoding of `list`

s as `int`

s for programming. It’s meant to be a fun experiment only.

Let’s look at an implementation in `python`

. A couple of notes here:

- We’ll allow ourselves to use functions with
`yield`

because it simplifies things a great deal.^{5} - You’ll notice an excessive use of
`while`

loops. That’s because list comprehensions,`range`

, and most things you’d consider using in a`for`

loop are banned from our int-only dialect. All of those end up being replaced with`while`

loops here.

The first function we’ll write is an iterator that yields primes in order. This will be useful throughout. This implementation is the *simplest *implementation that works. I might write an entire post in the near future about algorithms for generating primes, because it’s such a cool topic, and a venerable field of research in itself. The algorithm most people know is the Sieve of Erathosthenes, but that’s just the tip of the iceberg.^{6}

For today, a very naïve implementation will do.

```
def primes(starting: int = 2):
"""Yield the primes in order.
Args:
starting: sets the minimum number to consider.
Note: `starting` can be used to get all prime numbers
_larger_ than some number. By default it doesn't skip
any candidate primes.
"""
candidate_prime = starting
while True:
candidate_factor = 2
is_prime = True
# We'll try all the numbers between 2 and
# candidate_prime / 2. If any of them divide
# our candidate_prime, then it's not a prime!
while candidate_factor <= candidate_prime // 2:
if candidate_prime % candidate_factor == 0:
is_prime = False
break
candidate_factor += 1
if is_prime:
yield candidate_prime
candidate_prime += 1
```

```
def empty_list() -> int:
"""Create a new empty list."""
# 1 is the empty list. It isn't divisible by any prime.
return 1
```

```
def iter_list(l: int):
"""Yields elements in the list, from first to last."""
# We go through each prime in order. The next value of
# the list is equal to the number of times the list is
# divisible by the prime.
for p in primes():
# We decided we will have no trailing 0s, so when
# the list is 1, it's over.
if l <= 1:
break
# Count the number of divisions until the list is
# not divisible by the prime number.
num_divisions = 0
while l % p == 0:
num_divisions += 1
l = l // p # could be / as well
yield num_divisions
```

```
def access(l: int, i: int) -> int:
"""Return i-th element of l."""
# First we iterate over all primes until we get to the
# ith prime.
j = 0
for p in primes():
if j == i:
ith_prime = p
break
j += 1
# Now we divide the list by the ith-prime until we
# cant divide it no more.
num_divisions = 0
while l % ith_prime == 0:
num_divisions += 1
l = l // ith_prime
return num_divisions
```

```
def append(l: int, elem: int) -> int:
# The first step is finding the largest prime factor.
# We look at all primes until l.
# The next prime after the last prime factor is going
# to be the base we need to use to append.
# E.g. if the list if 18 -> 2**1 * 3**2 -> [1, 2]
# then the largest prime factor is 3, and we will
# multiply by the _next_ prime factor to some power to
# append to the list.
last_prime_factor = 1 # Just a placeholder
for p in primes():
if p > l:
break
if l % p == 0:
last_prime_factor = p
# Now get the _next_ prime after the last in the list.
for p in primes(starting=last_prime_factor + 1):
next_prime = p
break
# Now finally we append an item by multiplying the list
# by the next prime to the `elem` power.
return l * next_prime ** elem
```

You can open a `python`

, `ipython`

or `bpython`

session and try these functions out!

You should use numbers between `1`

and `10`

for the list elements. If you use larger numbers `append`

and `access`

can take a *very* long time. In part, this is an impractical fact about using Gödel numbering for lists, although the range of viable element values can be improved a great deal by optimizing the prime generation and factorization algorithms.

```
In [16]: l = empty_list()
In [17]: l = append(l, 2)
In [18]: l = append(l, 5)
In [19]: list(iter_list(l))
Out[19]: [2, 5]
In [20]: access(l, 0)
Out[20]: 2
In [21]: access(l, 1)
Out[21]: 5
In [22]: l
Out[22]: 972
```

We saw one way in which we can represent lists of natural numbers as `int`

s. There’s other, more practical ways that rely on subdividing the binary representation of the number into variably-sized chunks. I’m sure you can come up with how that would look.

In the future I might write other posts about better algorithms for generating primes and factorizing numbers, and also about int-representations of other complex data structures.

- I would assume the implementation breaks up at some point even before you run at out memory, but the documentation does explicitly mention they have
*unlimited precision*. - Note this is true of
`python3`

but not of`python2`

. In the case of`python2`

,`int`

s are fixed-size. I think it’s safe to just say`python`

and mean`python3`

in 2020, but I also think this detail is worth a footnote. - For the case of Gödel numbering for lists, it’s actually easy to argue that it’s a particularly
*bad*representation. We’ll talk a bit about the trade-offs in the representation later in the blogpost. - We could store the length of the list in a separate
`int`

, so that we know how many`0`

s at the end of the list to take into consideration. If we don’t want to have a*whole separate*`int`

, we can always write the length of the list as the exponent of and start the*actual*list with the exponent of . This has some redundant information, though. The way to avoid redundant information is to store*the number of final*in the list, instead of the entire length. We won’t be worrying about`0`

s*any*of this, though. - Note that using
`yield`

is no different from using`return`

and taking as an argument a`state`

variable (it’s often enough to have the last returned element). This is a bit like Continuation Passing Style. Also similar to the usual accumulator hack for making non-tail recursive functions tail-recursive. If you’ve never heard of the accumulator trick, here are some links [1], [2] I might one day write about imitating iterators in languages that don’t have them. - See also the paper The Genuine Sieve of Erathosthenes, which clears up a common confusion about how the algorithm is defined.

All of that lead to me writing FBSim^{1}, which I struggled to describe as *“a game-like football simulation environment for trying out AI ideas, written in Rust*“. As a game-dev framework, I used amethyst. Rust, amethyst and FBSim are all open source.

The idea is the following: you write functions defining how each of your players are to behave according to their role (forward, left, right, defender and goalie)^{2} and according to the environment, which includes information about the position of every teammate, every opponent, the ball and the net. FBSim does the rest for you.

In Rust terms, all you need to do in order to make your own AI for the game and see it play against other AIs is to implement the `Engine`

*trait* –a trait is like a `protocol`

in swift, an `interface`

in Go, or an `Abstract Base Class`

in python: essentially, a set of methods–, then you register your engine and change the configuration to make the players use it.

We’ll be implementing a simple AI for FBSim from scratch later on in the blogpost (beginners to Rust are welcome too), but first let me show you how the game looks like.

These are two very simple stock AIs playing each other.

Exciting? *No? *Well… uhh… it’s a bit more exciting if you write the AI yourself.

Let’s go ahead with the implementation.

First, let’s download the source code, and checkout the tag for the code that’s compatible with this blogpost.

```
git clone git@github.com:IanTayler/fbsim.git
cd fbsim
git checkout blogpost-1b # APIs may change in the future!
```

If you have Rust and `cargo`

installed, you’re ready to go ahead. If not, then you should check out `rustup`

. Once you’ve installed `rustup`

, and used it to install the compiler (I’m using 1.47.0) and `cargo`

, you’re good to go.

**Note: **FBSim uses `const fn`

with a `match`

statement, so you will need a relatively new version of the Rust compiler. If you’re getting errors mentioning `const fn`

and `match`

, then you probably need to update your compiler to a newer version.

In order for amethyst to work, you should first install some system dependencies, as detailed here. Install the dependencies according to your distribution and then you can proceed to run the following.

```
source ./env.sh # necessary due to a rendy bug with vulkan
cargo run
```

If you’re under macOS, you first need to change the `Cargo.toml`

to use `metal`

instead of `vulkan`

.

Your final Cargo.toml file should look like this:

```
[package]
name = "fbsim"
version = "0.1.0"
authors = ["Ian Tayler <iangtayler@gmail.com>"]
edition = "2018"
readme = "README.md"
repository = "https://github.com/IanTayler/fbsim/"
license = "MIT"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies.amethyst]
version = "0.15.3"
features = ["metal"]
[dependencies.serde]
version = "1.0.104"
[dependencies]
rand = "0.7.0"
rand_distr = "0.3.0"
```

Now, you can save the file and run:

```
cargo run
```

I didn’t test this on a mac, so these instructions are based on speculation. Let me know if it doesn’t work for you.

**Note: **It was brought to my attention that some people were having the following issue mentioning `gfx-backend-metal`

and `xcrun`

on macOS. If that’s you, follow the linked issue to get a likely solution.

I think the simple `cargo run`

should work by itself under windows. Let me know if that’s not the case!

If it works, you should see something like the following screenshot. The players should be moving about. Some of them could be standing still. That’s normal.

What we’re going to do now is to implement a new `Engine`

and try it out against one of the stock ones.

Implementing an `Engine`

amounts to implementing a bunch of functions that take as input a representation of the state of the game, and output a description of how the agent should behave next frame.

In essence, we’ll be telling our players what to do based on what they can see.

As input we’ll get an `EngineData`

struct. Here is how it’s defined.

```
/// Input for engine functions.
pub struct EngineData<'a> {
pub ball_position: Vector2<f32>,
pub own_position: Vector2<f32>,
pub own: &'a player::Player,
pub own_type: &'a player::PlayerType,
pub own_net_position: Vector2<f32>,
pub opponent_net_position: Vector2<f32>,
pub teammates_position: Vec<(PlayerType, Vector2<f32>)>,
pub opponents_position: Vec<(PlayerType, Vector2<f32>)>,
}
```

If you’re new to Rust. You can ignore the `'a`

. It’s a lifetime parameter. The rest should look somewhat familiar if you’ve used languages like C, C++, Go, etc. All the `pub`

s just mean we want the fields and the struct to be publicly exported. Let’s look at the fields we actually care about.

`ball_position: Vector2<f32>`

This will have the *absolute* position of the ball –(0, 0) is the lower left of the screen–. You can access each coordinate by using `Vector2`

‘s `x`

and `y`

fields. `Vector2`

s also have special methods defined and most operations (`+`

, `-`

, `*`

) behave like you’d expect if you’ve worked with `numpy`

and similar libraries.

The actual `Vector2`

type is defined in the `nalgebra`

crate^{3}. The `<f32>`

part just means the values in each coordinate are 32bit floats.

```
own_position: Vector2<f32>,
```

Same as above, but this will have the currently-running player’s position.

```
own_net_position: Vector2<f32>,
opponent_net_position: Vector2<f32>,
```

You get the idea, right? Positions of the centers of both nets. `opponent_net`

is where *you* are trying to score. These are also *always* absolute positions, so they will be the same value every time.

```
teammates_position: Vec<(PlayerType, Vector2<f32>)>,
opponents_position: Vec<(PlayerType, Vector2<f32>)>,
```

Here we have a vector (`Vec`

, similar to C++’s vectors, python’s lists, etc.) having the positions of all your 4 teammates (in the case of `teammates_position`

) or the positions of your 5 opponents (in the case of `opponents_position`

); along with the respective role of each of those opponents or teammates –each element is a pair with role first and position later–. We will not be using these two fields for the rest of this post, but if you plan on implementing a good engine, you probably should.

```
/// Return type of engine functions.
pub struct EngineTransition {
/// Velocity vector (pixels per second). Magnitude will
/// be cropped to player speed!
pub velocity: Vector2<f32>,
/// Which action to activate (if any).
pub action: Option<player::ActionType>,
}
```

This is what our functions will return. We define our player’s velocity in the x and y axis, and we set the player’s action, if any.

The `action`

field will be either `Some(ActionType::Kick)`

if we want our player to kick the ball if it collides with them next frame and `None`

if we want the player to gently push the ball forward.^{4} There are no other actions in this version of FBSim.

```
pub trait Engine {
fn goalie(&mut self, engine_data: EngineData) -> EngineTransition;
fn forward(&mut self, engine_data: EngineData) -> EngineTransition;
fn left(&mut self, engine_data: EngineData) -> EngineTransition;
fn defender(&mut self, engine_data: EngineData) -> EngineTransition;
fn right(&mut self, engine_data: EngineData) -> EngineTransition;
fn dispatch(&mut self, engine_data: EngineData) -> EngineTransition {
match engine_data.own_type {
PlayerType::Goalie => self.goalie(engine_data),
PlayerType::Defender => self.defender(engine_data),
PlayerType::Forward => self.forward(engine_data),
PlayerType::Left => self.left(engine_data),
PlayerType::Right => self.right(engine_data),
}
}
}
```

What this says is we need to implement methods called `goalie`

, `defender`

, `left`

, `right`

and `forward`

. Each will govern how each of our roles play. The `dispatch`

method is already implemented by default, and it’s unlikely you’ll want to change it.

The only difference between roles is the position they start on in the field after each goal and the fact that goalies have a bigger hitbox.

When convenient, you can use `SimpleEngine`

, which only asks that you implement an `engine_func`

method and implements all the `Engine`

methods for you.

This can be useful when you will use exactly the same code for *all *your roles. You just write that code once in `engine_func`

and `SimpleEngine`

uses it for `goalie`

, `left`

, `right`

, etc. indistinctly.

```
pub trait SimpleEngine {
fn engine_func(&mut self, engine_data: EngineData) -> EngineTransition;
}
```

This is what we’ll be using. But you’ll likely want to have different methods for your different roles if you implement something more complex.

We’ll create a file `src/engines/myengine.rs`

. That’s where our entire engine will live.

We’ll first create the silliest engine possible: we’ll always stay put and not do any actions. Later we’ll make a less silly engine.

```
// src/engines/myengine.rs
//
// First import everything we need.
// "crate" are imports relative to our own code.
// So crate::components is under src/components, for example.
use crate::{
components::player::ActionType,
engines::{EngineData, EngineTransition, SimpleEngine},
};
// We don't need any fields. We just define an empty struct.
pub struct MyEngine;
// We use `impl` to implement the `SimpleEngine` trait.
impl SimpleEngine for MyEngine {
fn engine_func(&mut self, engine_data: EngineData) -> EngineTransition {
// In Rust, the last expression in the function is the return value!
// Here, we return an EngineTransition, as the trait defines.
// Remember not to add a semicolon (;) for the return expression
// as that evaluates to an empty value ()!
EngineTransition {
// Stay put! Moving is dangerous!!!! (´･_･`)
velocity: math::Vector2::new(0.0, 0.0),
// Don't act! Actions have consequences!!!!! (´･_･`)
action: None,
}
}
}
```

Now we need to register our new engine in the EngineRegistry. For that, we’ll need to export our `MyEngine`

struct and import it in the registry.

We export our new file using `pub mod`

in the module’s `mod.rs`

file.

```
// src/engines/mod.rs
pub use self::engine::{Engine, EngineData, EngineTransition, SimpleEngine};
pub mod basic;
pub mod engine;
pub mod myengine;
```

Now we add it to the registry in `src/resources/engine_registry.rs`

. First step is importing it.

```
// src/resources/engine_registry.rs
use crate::engines::{
basic::{Basic, BasicWingWait},
myengine::MyEngine,
Engine,
};
use std::collections::BTreeMap;
```

And then, later down the same file, we add an `insert`

in the definition of `EngineRegistry::default`

.

```
// src/resources/engine_registry.rs, line 41
impl Default for EngineRegistry {
fn default() -> Self {
let mut registry =
EngineRegistry::new(BTreeMap::<String, Box<dyn Engine + Send + Sync>>::new());
registry.insert("basic".to_string(), Box::new(Basic::new()));
registry.insert(
"basic_wing_wait".to_string(),
Box::new(BasicWingWait::new()),
);
registry.insert("myengine".to_string(), Box::new(MyEngine {}));
registry
}
}
```

Here we added an entry to a map from a `String`

to a `Box<dyn Engine>`

.^{5,6}

Two things require explanation for people less familiar with Rust: `Box`

and `to_string`

. If you already know about them, skip to the next section.

We need to use a `Box`

around our Engine due to the way Rust handles dynamic dispatch to structs. `Box`

allocates our struct in the heap and allows us to handle structs of different sizes in the same data structure.

The reason why we need to call the `to_string`

method on `"myengine"`

is because string literals in Rust are not really of type `String`

but rather of type `&str`

. This is an optimization that avoids heap-allocation of literals, but the fact that they’re of a different type means we have to call `to_string()`

on them if we *need* a heap-allocated `String`

.

Currently, the best way to use a different engine is to change the configuration in `assets/player.ron`

. This will change the engine used by the upper-side team. The lower-side team is governed by the configuration in `assets/enemy.ron`

.

I’ll add a command-line parameter for this in the future.

For now, what you need to do is change the value in `EngineRunner`

to `"myengine"`

.

```
// assets/player.ron line 123
extras: PlayerData(
player: (
speed: 48.0,
kick_strength: 256.0,
push_strength: 96.0,
side: UpperSide,
),
robot: Robot(logic_module: EngineRunner("myengine")),
),
```

All that’s left to do is to run `cargo run`

again. You may need to do `source ./env.sh`

again if you’re running linux and you closed your first terminal.

You should see the upper-side players stay put while the lower-side team does its best to score.

Okay. That wasn’t much. Let’s go ahead and write an engine that at least *tries *to play the game. We’ll modify `MyEngine`

directly so that we don’t need to worry about registering another engine.

Here’s the code:

```
// We use `impl` to implement the `SimpleEngine` trait.
impl SimpleEngine for MyEngine {
fn engine_func(&mut self, engine_data: EngineData) -> EngineTransition {
// The idea behind this engine is simple:
// 1. If we're close to the ball, run towards it and kick it!
// 2. If we're far away from the ball, run to your own net! Defense!
//
// First compute what's the difference in position with the ball.
let difference_with_ball = engine_data.ball_position - engine_data.own_position;
// Vector2 from nalgebra implements `norm()` which computes the euclidean norm.
if difference_with_ball.norm() > 100.0 {
// We're far away from the ball. Find in which direction is your net!
let difference_with_own_net = engine_data.own_net_position - engine_data.own_position;
// Now run in that direction as fast as you can!
EngineTransition {
velocity: difference_with_own_net * engine_data.own.speed,
action: None,
}
} else {
// We're close to the ball!
// ATTAAAAACKKKK!!!!!
EngineTransition {
velocity: difference_with_ball * engine_data.own.speed,
action: Some(ActionType::Kick),
}
}
}
}
```

I didn’t explain what `engine_data.own`

was before. It just holds a bunch of constant data about the player: their speed, how hard they can kick the ball, etc.

For those of you unfamiliar with Rust, the reason why this method works and returns the right value from inside the `if ... else ...`

is because `if-else`

itself is an expression, and that expression evaluates either to *the last expression* of the* if *block in case the condition is `true`

, or *the last expression* in the *else* block, if the condition evaluates to `false`

. In other words, `if-else`

behaves like the ternary operator `A ? B : C`

does in other imperative languages.

If you save those changes and run `cargo run`

you should see a bunch of players desperately running toward their own goal and kicking the ball away as hard as they can when it gets close to them.

The code for MyEngine, including the changes to the registry and the configuration can be found in the branch `blogpost-1-implemented`

of the github repo for FBSim.

If you want to keep improving `MyEngine`

, with a little bit of work you should be able to beat `basic`

and `basic_wing_wait`

without having to do anything fancy. You can look at their implementation under `src/engines/basic.rs`

for inspiration. Remember to change `assets/enemy.ron`

to use `basic_wing_wait`

instead of `basic`

if you want to play against it. Think of `basic`

as level 1 and `basic_wing_wait`

as level 2.

This was a fun project during my two-week vacation. Nothing serious. If I keep working on it, this is what I’d like to do.^{7}

- Adding support for writing engines in scripting languages. Was thinking of python and lua. Might try to throw in mun just for fun.
- Training Reinforcement Learning agents for FBSim. Might do it in Rust directly or wait until I implement python scripting depending on the state of the Rust ecosystem for ML/RL.
- Adding support for self-managed tournaments between a group of engines.
- Maybe more game-like features to lower the entry barrier?

Big thanks to my friends Elías Masquil and Joaquín Alori who shared very helpful comments about earlier versions of this blogpost and FBSim.

- I
*do*know “FBSim” is the worst name ever. Don’t judge me. - Raise your hand if the mention of a goalie was the first time you realised I was talking about soccer. If you need to localize the game to North America, I think changing the sprites and the background colour to make it look like ice hockey shouldn’t be too difficult.
- When using amethyst, you should probably not import nalgebra directly as you can get most symbols by importing
`amethyst::core::math`

, which re-exports a lot of stuff from nalgebra. - If you’ve never heard of an
`Option`

type, it’s a type that encodes a value that may not exist, in a similar way of how you’d use`None`

in python, sentinel values in`C`

, and other contraptions in other languages. The difference is you need to also mark the case where the value*does*exist as`Some(value)`

, and not just`value`

. This way, the compiler can check whether you’re making the mistake of treating a value that*may not*exist as if it*always*existed. It’s a cool little idea that’s most common in functional programming languages, but has been popular with imperative languages as well recently. - It’s actually
`Box<dyn Engine + Sync + Sen`

d> but let’s not go into unnecessary detail. - You can ignore the
`dyn`

if you don’t know what that is. It’s just telling Rust that we don’t know at compile time the exact type of the objects in the boxes: we only know about a trait they implement. - Not counting the small changes like adding a command-line argument for setting which engine to use, etc.

His proof is very simple, so much so that I think probably someone with more talent for educating than me could probably teach it to middle school students. I think that’s really cool.

Formally, the sum we’re talking about is the sum of for all positive integers . What d’Oresme did was proving that for any number , you can find a finite so that the sum of the first terms (i.e. ) in the original sum is larger than . If the first terms sum to more than , then the entire sum must be larger than as well (because there are no negative numbers in the sum). That means the sum is larger than *any* finite number and, therefore, it has to be infinite.

But how do you find the right that will make your sum large enough? Well, d’Oresme used the following fact: if you take all the numbers with and , then you have numbers that are all larger than . The sum of all of them must be larger than .

The above proves that the first number (i.e. with and ) has to be larger than . From to (i.e. ) the sum should also be larger than . The numbers from to also sum more than . In other words, the sum of numbers from any power of to the next power of should sum over . It’s easy to see from this that the first numbers in the series should sum something that’s larger than .

In other words, the such that the sum of the first terms is higher than is . And that, in a nutshell, is Nicole d’Oresme’s proof of the divergence of the harmonic series.

Nicole d’Oresme studied a bunch of cool topics. I think he, as pretty much all other medieval scientists, is an under-appreciated historical figure. I like reading about medieval scientists and their works, so I might post other interesting things from them in the future.

Do bear in mind that I’m not a historian, so any historical commentary I include in my posts about medieval scientists should be taken with a grain of salt.

]]>`setuptools`

‘s `pkg_resources`

for handling data files in your package, then `__init__.py`

files to all your internal sub-directories.`__init__.py`

packages everywhere as if you were using python2, because not having them also creates issues with some linters and other tools.
`__init__.py`

files used to be obligatory for sub-packages up until PEP-420 –yes, 420– which was implemented in python 3.3. Now `python`

itself can handle code in any subdirectories as modules without *needing* an `__init__.py`

file everywhere.

That said, certain libraries and tools don’t handle modules without `__init__.py`

files correctly. In particular `pkg_resources`

behaves weirdly when you try to access data in a module that doesn’t have an `__init__.py`

file.

Suppose you have the directory structure below. In this situation, if you run `resource_exists("mod1.mod2", "some_data_file.txt")`

–where `resource_exists`

is imported from `pkg_resources`

— while standing on the root of the directory tree below, it will raise a `NotImplementedError`

, which is not the most helpful of messages.

```
.
└── mod1
├── __init__.py
└── mod2
└── some_data_file.txt
```

This is if you’re *not* working on an installed package. If you actually install it, you may get a weirder `expecting str, bytes or os.Pathlike, not None`

kind of error. I’m not sure when you get one or the other, but the point is: if there isn’t an `__init__.py`

file, there’s a good chance you won’t get what you’re looking for.

To fix this you just have to create an `__init__.py`

file; ending up with the following directory structure; and then `resource_exists("mod1.mod2", "some_data_file.txt")`

just returns `True`

.

```
.
└── mod1
├── __init__.py
└── mod2
├── __init__.py
└── some_data_file.txt
```

The case of `pkg_resources.resource_filename`

is even more fun. Check this out.

When ran on a non-installed package and if you don’t have the inner `__init__.py`

, `resource_filename("mod1.mod2", "some_data_file.txt")`

will return `"some_data_file.txt"`

. So, just your second argument. Not even `"mod1/mod2/some_data_file.txt"`

. Just the final filename. Why it does this, I’m not sure.

If you do have the `__init__.py`

then it will return `"/absolute/path/to/your/project/mod1/mod2/some_data_file.txt"`

. This is the normal behaviour of this function. You usually want to `open`

the output of `resource_filename`

, so the path must be complete (although it doesn’t need to be absolute).

Note that if you actually pass it a module that doesn’t exist (as opposed to a module that exists but doesn’t have an `__init__.py`

file), it will raise a `ModuleNotFoundError`

.

This means something like the following function works even without an `__init__.py`

. Of course, I don’t need to tell you **not** to use a function like this one as it is the dirtiest hack around, and probably has its own set of bugs and edge-cases, but I though it was funny that something like this worked when I tested it.

```
def resource_filename(module_path, resource_path):
pkg_resources_path = pkg_resources.resource_filename(module_path, resource_path)
if "/" not in pkg_resources_path:
mod = sys.modules[module_path]
if isinstance(mod.__path__, list):
module_fs_path = mod.__path__[0]
else:
# Hoping this behaves like a _NamespacePath.
module_fs_path = mod.__path__._path[0]
os.mknod(os.path.join(module_fs_path, "__init__.py"))
importlib.reload(mod)
pkg_resources_path = pkg_resources.resource_filename(module_path, resource_path)
return pkg_resources_path
```

On a different note, as I was writing this I also noticed that the `.__path__`

attribute of a module returns a `List[str]`

if the module has an `__init__.py`

and a `_NamespacePath`

, –whose attribute `_path`

is a `List[str]`

— if the module doesn’t. That’s fine because you really shouldn’t expect consistent behaviour from `__*__`

variables, but I thought it was a weird quirk. In a more general way, if you don’t have an `__init__.py`

, your module is called a `namespace`

and looks different from a “normal” module in several ways.

The lesson is PEP-420 did not save us from `__init__.py`

files. Just use them everywhere and save yourself future problems.

Typical scenarios where AP might come in handy include:

- You’re classifying but the threshold is going to vary depending on external variables, so you can’t use normal thresholded metrics.
- The classifier model is to be used both as a classifier and a ranker in different applications.
- You’re building a pure classifier, but you’re interested in understanding how your classifier ranks results by its scores, because it could help you better understand where it’s going wrong, etc.

There’s one important property of AP which in my experience is rarely mentioned but incredibly important when used for classification problems:

*Average Precision depends heavily on positive rate* *(i.e. class priors)*

To some people this might seem obvious. *Precision* is famously skewed by positive rate, so surely *Average Precision *must be too! Well, that’s not a great argument. Although it reaches the right conclusion.

When I ran into this for the first time, I actually managed to convince myself of the opposite: that AP was independent of the positive rate. That’s wrong. This is how I tricked myself: “AP is the area under the precision-recall curve. If positive rate is high, then it will be easy to have high *precision*, but it will be hard to catch all the positive items (i.e. have high recall). If positive rate is low, then *precision* will be hard to get, but *recall* will be easier. Probably the changes in *precision* and *recall* cancel each other out to some degree.”

It took me a couple of minutes to realize the mistake in the above reasoning.^{1} Sloppy half-reasoning like that is often useful but dangerous. It’s important to always double check by either consulting the literature or trying to find a more rigorous argument or proof.

In this case, it only took me a couple of minutes to realize where the problem was: *recall* does not vary with positive rate, unlike what I implied. Of course, if the positive rate is higher, there’s going to be more items that you won’t catch, but the ratio of caught items is expected to stay the same.

Let’s visualize this with an example. In the following image we’ll position blue and orange dots on a line. Each dot represents an item in the test set. The farther to the right the item is, the higher the *predicted score* for that item is. Blue dots are *positive examples*, while orange dots are *negative examples. *Let’s look at what happens with *precision* (P) and *recall* (R).

Now we’ll look at the same exact example, but we’ll undersample from the positive class. Note that *the model giving the scores is exactly the same*, all that’s changing is that we’re getting rid of some of the positive examples in the test set.

Notice how when we undersample from the positive class, *precision can only go down.* There is no positive example we can eliminate that will make *precision* go up. *Recall*, on the other hand, may go up or down depending on which side of the threshold we eliminate items from, but the expected result is that it will stay the same. More on that later.

That example illustrates how when we reduce the positive rate in a test set, *precision* is expected to go down and *recall* is expected to stay the same. Of course, a single example with a small dataset doesn’t prove anything, but hopefully it helps visualize why *precision* and *recall* behave differently when we change the positive rate.

Another way to visualize how positive rate won’t affect your *recall* is to imagine having a dataset and then *undersampling the negative class. *When you undersample the negative class positive rate will go up, but *recall* won’t change, as the formula for *recall* (`TruePositives / AllPositives`

) simply ignores all negative examples. In other words, *recall* depends only on the distribution of scores for items of the positive class, while *precision* depends on both the distribution of scores of the positive and the negative class, and also on the class priors.^{2}

What we’re seeing then is that for a particular threshold, lowering the positive rate is expected to lower *precision* and maintain *recall*.^{3} In consequence, all values of the *precision-recall curve* are expected to go down, and therefore, AP –being the area under the *precision-recall curve*— is expected to go down as well.^{4}

When could this become a problem? There’s two big cases.

- Comparing the performance of models tested with different data.
- Comparing the performance of a model in different subsets of a test set.

Suppose the following case: your classifier is retrained periodically with new data, and you want to know whether the model got better or worse. It’s dangerous to compare APs for different test sets because if the positive rate changed from one to the other, then you may have a big swing in AP that’s not really related to how good the model is!

One option to mitigate this could be to test all models with the same dataset, or to create carefully crafted stratified test sets that keep these properties constant artificially.^{5} Another option is to use AUC-ROC or other metrics that don’t have this problem instead of AP. But obviously, they come with their own caveats.

The other case is when you want to check for which types of items your model performs well or bad. If you simply compute AP for different subsets of your test set, you need to be very careful that the positive rates in those different subsets stay the same! If not, AP might give you a completely wrong idea.

To make it easier to visualize: if the positive rate is 50%, *a completely random model is expected to have an AP of 0.5*. On the other hand, if the positive rate is 2%, AP of a random model will be 0.02. Getting an AP of 0.5 in that case might be incredibly hard.^{6}

Next time you’re working with AP in a classification context, be weary of the effect positive rate has on it.

- Apart from the obvious fact that having two things pulling in opposite directions is not enough to say they cancel each other out.
- With threshold
*recall*for a large enough test set will be . That’s invariant to class priors, and to the underlying distribution of positive class items in the test set. - “Lowering the positive rate” is reducing the ratio of the test set that is of positive class. It’s assumed that the underlying class-conditional distributions stay the same. Crucially, though, if in real life you see your positive rate change significantly, this will likely be accompanied by a change in class conditional distributions as well! Analysing such cases is complex. Understanding that AP reflects positive rate is just the first step in avoiding pitfalls when you have varying positive rates in test sets.
- There’s an important interaction that I decided to leave out of this blogpost for simplicity: if by undersampling we cause the
*minimum score for positive class items*to change (e.g. if by undersampling the positive class, we get rid of the positive items with the lowest score) then AP could go up*even if precision at all thresholds goes down*! That’s because if the minimum score for ground-truth-positive items goes up, low thresholds (which usually have bad precision) will stop being taken into consideration for computing AP. All of this is not too important because for big-enough, dense-enough, natural datasets this effect should be much smaller than the one introduced by the change in precisions at each threshold. Nevertheless, I found it interesting. - Of course, doing this makes evaluation metrics harder to interpret, as they no longer mean simple things like “the probability of a classified-positive to be ground-truth-positive”. Disentangling the meaning of an evaluation metric in a stratified test set requires some work. Also, doing this might penalize you too much for poor performance in very rare items.
- Interpreting some value of some evaluation metric to be “good” or “bad”, “hard” or “easy” depends on a lot of different factors. Here, “hard” is just intended to mean “much better than random guessing”.

A lot of the work I’ve done for the past two months has revolved around one huge DynamoDB table and a few smaller tables complementing the huge one. I thought I’d share some of what I learned in the process.

I’m not going to be talking about the basics of DynamoDB (eventual consistency, avoiding scans and hot partitions, keeping items under 1KB, etc.). For the most basic advice, you should check out the DynamoDB docs.

Spoiler alert: the main takeaway from this post will be the following.

*“DynamoDB will not scale unless you know what you’re doing from the very beginning.”*

A lot of people have mentioned this already 1 2 3. Hot partitions

are probably the biggest problem here, but I’m not going to talk about that

(yet).

Queries return up to 1MB of data (and this is *before* any filtering is done).

Now, imagine you’re building some kind of social network where people post photos. You’ll save the following data.

- Entity:
*Photo*`UserId`

`Timestamp`

– moment when the photo was added to the database.`Thumbnail`

`PhotoLink`

– something to help you retrieve the actual photo stored in S3.`Likes`

– number of likes/shares/upvotes/whatever`Title`

`Description`

And your most important queries will be to get the last few photos by a

specific user.

Now, if you don’t know what you’re doing and decide to believe in the allegedly seamless scalability and ease of use of DynamoDB, you may consider having a single DynamoDB table with `UserId`

as the partition key and `Timestamp`

as the sort key, and having everything else as attributes. That way you can make a Query with `Limit=K`

and `ScanIndexForward=false`

to get the last `K`

photos of a user.

Of course, DynamoDB will overwrite any items with the same `UserId`

and `Timestamp`

, which is something you probably don’t want to happen. But you can fix that with a fixed-size random postfix to the `Timestamp`

or something of the sort. Let’s ignore that in the name of simplicity.

Now suppose you also want to add a feature where users can see their most-liked photos. This is where the 1MB limit in the query return size kicks in. You could implement it by querying for all photos by a user and sorting them by `Likes`

in-memory. But, if the number of photos by a user grows so that you have more than a few MB of data for a single user, you’ll have to get that data in several sequential requests with 1MB responses each, instead of a single query, and that will degrade the user experience significantly.

In the case I’ve been working on, there was no noticeable difference between getting 0.1MB of data (i.e. setting a pretty low `Limit`

) or 0.9MB of data in a query, but when we went over 1MB and had to do 2 sequential requests, the time pretty much doubled.

To do that query efficiently you’ll have to add a Global Secondary Index with `UserId`

as the partition key and `Likes`

as the sort key which essentially *doubles* the cost of your table.

So why not adding a local secondary index instead? First of all, that won’t save you a lot of money if your expenses come mostly from write capacity. But there’s something worse.

If you’re not very aware of this restriction, you could end up in a world of

pain.

When you have a local secondary index, you can’t have a collection of items with the same value in the partition key exceeding 10GB in total. In other words, applying the concept to this case, you can’t have more than 10GB of data for a single user (i.e. a collection of items with the same value in `UserId`

).

In this case, it’s fairly unlikely normal users will ever reach that limit, but a bot posting every 30 seconds could get there quite fast (depending on the item sizes). And in other applications the limit could be very much within reach.

You could have thought that the entity we were trying to model (Photo) required simple enough queries that it had to be trivial to model with DynamoDB, but you would’ve thought wrong. If you need to make queries that are at all more complex than just retrieving a single entry given a key, you should not take the decision of using DynamoDB lightly.

]]>