Probabilities cannot be obtained from data alone • Adapting's Blog

Our mind loves certainty; in fact, many of your cognitive distortions can be traced back to an attempt by our mind to make things one-sided, clear, easy to manage. Black and white thinking, jumping to conclusions, the ambiguity effect are all ways to save energy and make quick decisions. But what about making right decisions?

For that we need to work with a model that actually reflects reality (as best as we can). Since reality is uncertain, we had to include that somehow. The mathematical tool par excellence to do that is probability, so let’s start with that.

1. An unsettling question

The question goes like this: I throw a die and I tell you I got a 6, what’s the probability that I get another 6?

When the question looks easy there’s probably a trick (sometimes there isn’t). The standard probability textbook response would be 1/6, since the die has 6 numbers (1-6) and each is equally likely.

But where in the question did I say the die had 6 different numbers, or even 6 faces? It could be a die with 6s in every face, so the probability would be 1, or perhaps it has 20 faces and it’s 1/20. Actually this is what real life is all about, it’s not just that outcomes are uncertain, it’s that the systems themselves are not fully known, adding to that uncertainty. We could call this first order uncertainty and second order uncertainty.

Nassim Taleb coined the term ludic fallacy to describe our tendency to misuse games, which have no second order uncertainty, to model real life situations. The problem is that precisely the gist that makes real life much harder than games is that second order uncertainty.

2. Can’t set probabilities without prior beliefs

Now it’s time for Bayes’ theorem, which reads something like this:

\text{Posterior probability} = \text{Prior probability} \times \text{Evidence based update}

If you want to see the actual math you can take a read at this post. What’s basically saying is that you cannot get probabilities from data (evidence) alone, you have to put a little bit from your side. Data serves to update probabilities, they are a modifier on beliefs, but not a straight setter of them¹.

Of course, enough data is able to modify beliefs so strongly that it does not matter where you started from. Let’s say I threw the die a billion times and got a billion 6s. It’d be pretty reasonable to say the die is just full of 6s and the probability of getting another 6 is practically 100%.

But what if I had just gotten four 6s in a row? The probability of that happening with a normal die is quite low: $\left( \frac{1}{6} \right)^3 \leq \frac{1}{200} = 0.5\%$ . That would make me quite suspicious, but I still would give it a reasonable chance that the die is a normal one.

After all, the huge majority of dice in the world are what we expect. Why? Cubes are the simplest multifaced 3D object, they have 6 faces, and the point of dice is to create randomness, so it makes sense to have 6 different numbers, the most natural option being 1 to 6. Dice could be loaded, but that would require more sophisticated manufacturing and a purpose to do it (e.g., rig some kind of game).

So going back to the original question, for the chances of getting another 6, I would say something slightly higher than 1/6, and for the chances of getting something between 1-5 somewhat lower. I’d still give non-zero probabilities to figures like 20, although they’d be extremely small, less than 1% (and smaller still for stranger numbers like 998)¹. But this would not be the rigorous way to get these figures. The workflow would be:

Based on your prior knowledge, assign a probability to each possible scenario².
Based on the data received, update those probabilities, these are called posterior probabilities³.

Let’s make this more concrete by showing actual numbers. As an example, here is a possible prior on the dice outcomes before we have seen any data: prior example

If you are interested in the details, it’s just a weighted average of a uniform distribution on 1-6 and a geometric distribution.

Notice the probability decays exponentially, but never reaches zero. We need a fast decaying probability so that we can spread probability among as many numbers as possible, as we don’t want to close the doors to any possibility (more on this in the Addenndum.

Now for the updates below you can find two interactive plots so you can see the mechanics of the process in action.

You may have a different take on the figures you would give and that’s fine. Of course the updates on probabilities need to follow Bayes’ theorem to be rigorous, and your prior should be based on some reasonable expectation on the kind of die a person asking such questions would use. That’s the key, priors are not unique, but they need to be reasonable, and stronger domain knowledge will make them more accurate.

3. So what’s the deal?

“Probabilities cannot be obtained from data alone” is something that can surprise many of us, because we are used to doing precisely that. Why discuss assumptions or prior beliefs? Why bias the data with our opinions?

The thing is, not being explicit about your assumptions does not mean that you are not assuming anything, it just means you are leaving your assumptions unexamined. Behind every probability there’s an assumption, and when we try to look the other way, we tend to fall for a default that’s frequently exposed to the ludic fallacy. That is, we tend to underestimate uncertainty, which sets us at a disadvantage.

The ludic fallacy is saying the chances of getting a 6 is 1/6 without hesitation, only to find out afterwards the die was made all of 6s. For a real example, the Global Financial Crisis is a good reference: running up to 2008, many financial institutions had based their decisions on miscalibrated probabilities, you know what came next.

To be clear, there are good reasons why we often rely on default, unexamined assumptions. They simplify the problem, they are generally not that far off. Just like our cognitive biases, they are there for a reason. It’s just that when the stakes become higher, and making the right decisions becomes critical; then the longer and harder path of precision is the only one that leads to success.

Thanks to Alejandro Sánchez Alarcón for the valuable feedback.

Addendum - Zero and almost zero are worlds apart

A key feature of the Bayes rule is that probabilities are multiplicative. Multiplicatively “0” is a very special number, much like fire, it turns anything into ashes, which in more concrete terms, means that if your prior probability of an event is 0, no matter how much data you gather, it will be keep being 0 for the longest time.

Is this a flaw of Bayes rule? Not really, if you assign 0 probability to an event, you are saying it’s absolutely impossible that it happens. So if it happens, your model is already broken, and it makes sense that a broken model does not update correctly. That’s why it’s so crucial to give non-zero probability, even if its $10^{-10}$ to every possible event, however remote.

This may be surprising, as to as $10^{-10}$ and $0$ are not that different, and in fact, in many areas of our life we make no distinction between the two (and sensibly so). But with enough data, you can end up multiplying $10^{-10}$ by a high enough number to result in a meaningful posterior probability; while with $0$ you’ll be stuck forever.

For our previous example, consider that the die in question instead of being a 6D it’s a 20D. We assigned very tiny probabilities to numbers 7-20, however, you can see how with enough throws they grow to the true value.

Footnotes

You may wonder: why? What we are doing here ultimately is modelling the world, that’s what we need to calculate probabilities, and there’s no way data will give you a model of the world. Data only tells you what happened at a particular point in time and space, and a model is meant to tell you what can happen at any point in time and space. If you just want to use the data you have, you’ll need to interpolate and extrapolate, and that involves making assumptions, that requires an opinion on how you think the world behaves in the vicinity of the datapoints you actually know. ↩ ↩²
Math allows to do this assignment all in one go, there’s no need to go one by one. ↩
The name posterior can be confusing as it relates to that particular dataset. If after doing your updates and getting your posteriors I gave you a whole new set of observations, you would consider the last posteriors you have as priors (they are prior to this last set of observations) and update them according to the new data. ↩