p
Bayes' Theorem and the Problem of Induction

Bayes' Theorem and the Problem of Induction

I’ve seen the Bayes' theorem proposed as a solution to the Problem of Induction. I don’t think that’s correct. Here I will show the intuition for the theorem, the derivation and explain why it’s not an induction method \(M()\) but it’s still useful for picking a hypothesis.

Conditional probability

Conditional probability is, as the name suggests, the probability of some event A, given some other event B has happened. We write it \( P(A|B) \) and read it “the probability of A given B”. The probability of A individually is expressed as \( P(A) \) and B as \( P(B) \)

Now consider the following fictitious contingency table for the probability of preferring chocolate or vanilla ice cream for both sexes. Note that someone can only prefer one flavor, you can like both, but only one is your favorite. Just like children.

Sex\Flavor   Chocolate   Vanilla   Total
Female 0.29 0.22 0.51
Male 0.31 0.18 0.49
Total 0.60 0.40 1.00

Those are the probabilities for four mutually exclusive groups:

  • Female who prefers chocolate 0.29;
  • Female who prefers vanilla 0.22;
  • Male who prefers chocolate 0.31;
  • Male who prefers vanilla 0.18.

The last row and the last column (the margins of the table) show the marginal probabilities for both flavor and sex. You can obtain the marginal probabilities by summing the probabilities in a given row or column:

  • The probability of being male is \(0.49 = 0.31 + 0.18\);
  • The probability of preferring vanilla is \(0.40 = 0.22 + 0.18\);
  • etc.

If we want to know the probability of someone being male and preferring vanilla, first we need to restrict the sample space to “males” \( P(Males) = 0.49\) then lookup the probability of males preferring vanilla \( P(Males \cap Vanilla) = 0.18\). We have:

\[ \frac{ P(Males \cap Vanilla) }{ P(Males) } = \frac{0.18}{0.49} \approx 0.367 \]

We can read that as the probability of vanilla given males, or more generally the probability of A given B \( P(A|B) \) and we write:

\[ P(A|B) = \frac{ P(A \cap B) }{ P(B) } \]

For example the probability of preferring chocolate \( (A) \) given female \( (B) \) is:

\[ P(A|B) = \frac{ P(A \cap B) }{ P(B) } = \frac{0.29}{0.51} \approx 0.57 \]

Bayes' theorem

What if obtaining \( P(A \cap B) \) is not so straightforward? What if \( P(A \cap B) \) depends on something else? Here is an example shamelessly copied from wikipedia. Suppose a cannabis test with the following properties:

  • 90% sensitive, meaning the true positive rate (TPR) = 0.90;
  • 80% specific, meaning true negative rate (TNR) = 0.80. The test correctly identifies 80% of non-use for non-users, but also generates 20% false positives for non-users, false positive rates (FPR) = 0.20;
  • 0.05 prevalence, meaning only 5% of the population really uses cannabis (here is where the problem arises). Here is a table:
Actual\Test   Positive   Negative  
User 0.90 0.10
Non-user 0.20 0.80

User   Non-user
0.05 0.95

What is the probability of being an user given they tested positive? That is, what is the probability of user given positive? If we apply the conditional probability formula we get:

\[ P(User|Positive) = \frac{ P(User \cap Positive) }{ P(Positive) } \hspace{1em} (1) \]

Hmm… ok? What now? We weren’t given \( P(User \cap Positive) \) nor \( P(Positive) \), we need to obtain them somehow.

We can try to guess \( P(User \cap Positive) \) from the table, the intersection of User and Positive is 0.90, but only that doesn’t suffice. We want the probability of user given positive for any given person in the population, thus we also need to account for the probability of any given person being an user, 0.05. We have:

\[ P(User \cap Positive) = P(Positive|User)P(User) = 0.90 \times 0.05 = 0.045 \hspace{1em} (2) \]

In the bayesian interpretation of probability that 0.05 is called an a priori hypothesis, something we know is true and use to compute the a posteriori probability. This is going to be important later.

We could also have obtained this expression by manipulating the conditional probability formula:

\[ P(A \cap B) = P(B|A)P(A) \]

Now to \( P(Positive) \), you can intuitively think of it as the total number of positives: The probability of being positive, given they are an user, times the user population. And the probability of being positive, given they are a non-user, times the non-user population. Except \( P(Positive) \), \( P(User) \) and \( P(Non–user) \) are probabilities and not populations.

Then we have:

\[ P(Positive) = 0.90 \times 0.05 + 0.20 \times 0.95 \approx 0.235 \hspace{1em} (3) \]

Substituting \((2)\) and \((3)\) in \((1)\):

\[ P(User|Positive) = \frac{ P(Positive|User)P(User) }{ P(Positive) } \approx \frac{0.045}{0.235} \approx 0.1915 \]

That is, even if someone tests positive, there is only a 19.15% chance of them being truly positive. That’s because only a small percentage of the population, 5%, are cannabis users and our test is kinda bad.

The formula \((2)\) we obtained is just the formula for the Bayes' theorem with User instead of \(A\) and Positive instead of \(B\):

\[ P(A|B) = \frac{ P(B|A)P(A) }{ P(B) } \]

The Bayesian interpretation of probability

Let’s restate the Bayes' theorem but change the letters to \(H\) for Hypothesis and \(E\) for evidence:

\[ P(H|E) = \frac{ P(E|H) }{ P(E) }P(H) \]

Bayesians interpret that as a relationship between the probability of the hypothesis \(P(H)\) before we obtained evidence \(P(E)\) and the probability of the hypothesis given that we have the evidence \(P(H|E)\).

That is, The Bayes' theorem is a method to update the probability of a hypothesis given new evidence. \(P(H)\) is also known as the a priori probability and \(P(H|E)\) is also known as the a posteriori probability.

In the cannabis test example above we updated the sensitivity hypothesis \(P(H) = 0.90 \) (TPR) given new evidence of the prevalence being \(P(E) = 0.05 \).

Bayesian probability is not a solution to the Problem of Induction

Bayesian probability only allows us to update the probability of a hypothesis \(P(H)\) given new evidence \(P(E)\) and for that it’s very useful, but it says nothing about why \(P(H)\) or \(P(E)\) should be correct everywhere (why they should be valid inductions). A few things could go wrong:

  • Procedural problems in the collection of the data that led to those probabilities;
  • Wrong modeling of the phenomena: A very small sample could look like an uniform distribution while the population is in fact normal; The sample could look normal but the population could be skewed;
  • The most daring proposition, the axioms currently used to derive the algebraic and calculus rules could not represent relationships in nature.

Bayesian probability and falsification

I came across this blog post, written in reply to this other, now offline, blog post, which in turn is a mirror of a part of another blog post (search “So why is it that some people are so excited about Bayes’ Theorem?"), which was deprecated and now redirects to a post on the very interesting platform arbital.

I was going to assume the part I had a problem with had been retracted and leave it at that but then I found this other post posing a similar idea. I’m not sure if it was written by the same person, I still don’t fully understand the arbital platform but let’s take a look at it anyways. This is all from the “Possibility of permanent confirmation” section, the rest of the article is interesting and very didactic:

It’s worth noting that although Newton’s theory of gravitation was false, something very much like it was true. So while the belief “Planets move exactly like Newton says” could only be provisionally accepted and was eventually overturned, the belief, “All the kind of planets we’ve seen so far, in the kind of situations we’ve seen so far, move pretty much like Newtonian gravity says” was much more strongly confirmed.

The problem is that a scientific theory should have predictive value, Newton’s theory could have worked until now but suddenly not work tomorrow because planets in our solar system decide to turn their plane of rotation 90º every 1 million years and we had no idea. Having perfect predictive knowledge of the past means nothing, as the author pointed in their article.

In mathematics we can have perfect knowledge of the domain of a problem and thus claim a theorem always works, nature doesn’t allow us such luxury.

This implies that, contra Popper’s rejection of the very notion of confirmation, some theories can be finally confirmed, beyond all reasonable doubt. E.g., the DNA theory of biological reproduction. No matter what we wonder about quarks, there’s no plausible way we could be wrong about the existence of molecules, or about there being a double helix molecule that encodes genetic information. It’s reasonable to say that the theory of DNA has been forever confirmed beyond a reasonable doubt, and will never go on the trash-heap of science no matter what new observations may come.

This is possible because DNA is a non-fundamental theory, given in terms like “molecules” and “atoms” rather than quarks. Even if quarks aren’t exactly what we think, there will be something enough like quarks to underlie the objects we call protons and neutrons and the existence of atoms and molecules above that, which means the objects we call DNA will still be there in the new theory. In other words, the biological theory of DNA has a “something sort of like this must be true” theory underneath it. The hypothesis that what Joseph Black called ‘fixed air’ and we call ‘carbon dioxide’, is in fact made up of one carbon atom and two oxygen atoms, has been permanently confirmed in a way that Newtonian gravity was not permanently confirmed.

Same as above, the fact that it worked so far doesn’t mean it will work tomorrow. Do I doubt the theory of DNA reproduction? Absolutely not, but that doesn’t mean there is any guarantee it will work tomorrow or everywhere. If we could guarantee such a thing we would have a method \(M()\) that solves the Problem of Induction.