skip to content
camaleon's log Adapting's Blog
Correlation as conditional probability

Linking correlation to probability - The binary way

/ 6 min read

Is there a link between the linear correlation coefficient and conditional probability?

One day, after observing that the correlation between two zero mean variables was 0.3, someone asked whether this meant that if one variable was positive the probability of the other one being positive is 0.3 too.

The short answer is no, but it’s not that straightforward to explain why; is it that that’s not the right figure or that correlation cannot be interpreted in such a way? The questioner had an interesting intuition, after all, if correlation measures linear dependency of random variables, shouldn’t there be some connection to conditional probabilities (i.e., to the probability of one variable matching the other’s variable variation) at some point?

Binary transformation

The first thing to notice is that the way the question is posed the interest is centered on binary outcomes (positive vs negative) while we often measure correlation for continuous variables (as in fact was the case in the aforementioned presentation).

Hence, the first step in bridging the gap between correlation and probability as framed in this case is to make our continuous variables binary. We call AA and BB the events in which the original continuous variables XX and YY are above their respective thresholds θx\theta_x and θy\theta_y and we define the indicator (binary) variables IAI_A and IBI_B as:

IA=1X>θx,IB=1Y>θy.I_A = \mathbb{1}_{X>\theta_x}, \\ I_B = \mathbb{1}_{Y>\theta_y}.

Of course in some cases our variables might be binary to begin with and we can ignore this step.

Correlation and its constituents

Now let’s recall the definition of correlation, which is none other than the covariance (how two variables vary together), normalized by each variable’s variance, so as to have a scale-independent output, always in the [-1, 1] interval.

Corr(IA,IB)=Cov(IA,IB)Var(IA)Var(IB).\text{Corr}(I_A, I_B) = \frac{\text{Cov}(I_A, I_B)}{\sqrt{\text{Var}(I_A)\text{Var}(I_B)}}.

Since the goal of this investigation is to bridge correlation with probability, in the following lines we are going to express the variance and covariance in terms of the probabilities of AA and BB.

Covariance

Let’s leave the variances for a moment and focus on covariance.

Cov(IA,IB)=E[IAIB]E[IA]E[IB]\text{Cov}(I_A, I_B) = \mathbb{E}[I_A I_B] - \mathbb{E}[I_A]\mathbb{E}[I_B]

Note that the product IAIBI_A I_B will be 0 if either variable is 0 and 1 in the intersection ABA \cap B, hence its expected value is the probability of the intersection:

E[IAIB]=P(AB).\mathbb{E}[I_A I_B] = P(A \cap B).

For the other expectations we simply have:

E[IA]=P(A),E[IB]=P(B),\mathbb{E}[I_A] = P(A), \quad \mathbb{E}[I_B] = P(B),

which we can now substitute in the original expression:

Cov(IA,IB)=P(AB)P(A)P(B).\text{Cov}(I_A, I_B) = P(A \cap B) - P(A)P(B).

We can also rearrange the expression of the intersection conditioning on one of the variables, let’s say AA, which allows us to take out P(A)P(A) as a common term:

Cov(IA,IB)=P(AB)P(A)P(B)=P(BA)P(A)P(A)P(B)=[P(BA)P(B)]P(A).\begin{aligned} \text{Cov}(I_A, I_B) & = P(A \cap B) - P(A)P(B) \\ & = P(B|A) P(A) - P(A)P(B) \\ & = [P(B|A) - P(B)] P(A). \end{aligned}

Analogously, we can decompose P(B)P(B) conditioning on AA: P(B)=P(BA)P(A)+P(BAc)P(Ac)P(B) = P(B|A)P(A) + P(B|A^c)P(A^c), where cc represents the complementary outcome, i.e., P(A)=1P(Ac)P(A) = 1 - P(A^c).

Cov(IA,IB)=[P(BA)P(B)]P(A)=[P(BA)P(BA)P(A)P(BAc)P(Ac)]P(A).\begin{aligned} \text{Cov}(I_A, I_B) & = [P(B|A) - P(B)] P(A) \\ & = [P(B|A) - P(B|A)P(A) - P(B|A^c)P(A^c)] P(A). \end{aligned}

We replace P(A)P(A) in the second term in the brackets by 1P(Ac)1-P(A^c) so as to make some cancellations:

Cov(IA,IB)=[P(BA)P(BA)[1P(Ac)]P(BAc)P(Ac)]P(A)=[P(BA)P(BAc)]P(Ac)P(A).\begin{aligned} \text{Cov}(I_A, I_B) & = [P(B|A) - P(B|A)[1 - P(A^c)] - P(B|A^c)P(A^c)] P(A) \\ & = [P(B|A)- P(B|A^c)] P(A^c)P(A). \end{aligned}

This will suffice, now let’s get back to variances.

Variance

Variance is nothing but the covariance of a variable with itself:

Var(IA)=Cov(IA,IA)=E[IA2]E[IA]2.\begin{aligned} \text{Var}(I_A) & = \text{Cov}(I_A, I_A) \\ & = \mathbb{E}[I_A^2]- \mathbb{E}[I_A]^2. \end{aligned}

Since IAI_A can only take 0 or 1 values, squaring it makes no difference:

Var(IA)=E[IA]E[IA]2=P(A)P(A)2=P(A)[1P(A)]=P(A)P(Ac).\begin{aligned} \text{Var}(I_A) & = \mathbb{E}[I_A]- \mathbb{E}[I_A]^2 \\ & = P(A) - P(A)^2 = P(A) [1 - P(A)] = P(A)P(A^c). \end{aligned}

Obviously the same applies to any other binary variable like IBI_B.

Bringing all together

Now we have all the ingredients to get an expression of correlation that is linked to conditional probabilities in the way it was conceived in the starting question. Let’s start by substituting our previous results in the definition of correlation:

Corr(A,B)=Cov(A,B)Var(A)Var(B)=[P(BA)P(BAc)]P(Ac)P(A)P(A)P(Ac)P(B)P(Bc)\begin{aligned} \text{Corr}(A,B) & = \frac{\text{Cov}(A,B)}{\sqrt{\text{Var}(A)\text{Var}(B)}} \\ & = \frac{[P(B|A)- P(B|A^c)] P(A^c)P(A)}{\sqrt{P(A)P(A^c)P(B)P(B^c)}} \end{aligned}

A little bit of rearrangement yields a cleaner expression:

Corr(A,B)=[P(BA)P(BAc)]P(Ac)P(A)P(B)P(Bc).\text{Corr}(A,B) = [P(B|A)- P(B|A^c)] \sqrt{\frac{P(A^c)P(A)}{P(B)P(B^c)}}.

So correlation is linked to the difference in the probability of YY being positive (here we are assuming θx=θy=0\theta_x = \theta_y = 0 as in the initial statement of the problem) when XX is positive versus negative. Notice how negative correlation implies that positive values of YY are less likely when XX is positive as expected. The square root is always positive and can be understood as a scaling factor that ensures correlation stays within its expected boundaries.

Easing interpretability

We can make a last simplifying assumption to get rid of the scaling factor which is slightly hindering interpretability. If we place the binarizing thresholds (θx\theta_x and θy\theta_y) in the median (or equivalently the mean, when the variable is symmetric), we will have P(A)=P(Ac)=P(B)=P(Bc)=0.5P(A)=P(A^c)=P(B)=P(B^c) = 0.5 and hence:

Corr(IA,IB)=P(BA)P(BAc).\text{Corr}(I_A, I_B) = P(B|A) - P(B|A^c).

Since, P(B)=P(BA)P(A)+P(BAc)P(Ac)=0.5P(BAc)=1P(BA)P(B) = P(B|A)P(A) + P(B|A^c)P(A^c) = 0.5 \medspace \Rightarrow \medspace P(B|A^c) = 1 - P(B|A), we can take it a little bit further:

Corr(IA,IB)=2P(BA)1.\text{Corr}(I_A, I_B) = 2P(B|A) - 1.

When we are given a correlation and want to convert it to conditional probability we would use the following alternative arrangement:

P(BA)=0.5+Corr(IA,IB)2.P(B|A) = 0.5 + \frac{\text{Corr}(I_A, I_B)}{2}.

Since 0.5 is the marginal probability of BB, the above expression means that knowing the event AA has happened, increases the probability of BB by half the correlation between the indicator variables.

We can make a handful of checks to see everything seems right. As expected, when the correlation is 0, we have P(BA)=0.5P(B|A) = 0.5, as the two variables are independent, so knowing AA changes nothing. If it’s 1, P(BA)=1P(B|A) = 1, as AA and BB share the same probability space. Conversely, if it’s -1, P(BA)=0P(B|A)=0.

Answering the initial question

Going back to our initial example (and including the assumptions we’ve made along the way), if after binarizing we got a correlation of, let’s say, 0.2, that would mean that the probability of YY being positive when XX is positive is 0.5 + 0.2/2 = 0.6.

Quick comment on the continuous approach

It’s possible to derive directly from a continuous bivariate distribution the conditional probability without recalculating correlation for the indicator variables. This path is however more analytically complex and so it will be left for another time. To get an idea on how results may differ, in the most common case of the normal bivariate distribution, we would have:

P(BA)=0.5+arcsinCorr(X,Y)π.P(B|A) = 0.5 + \frac{\arcsin \text{Corr}(X, Y)}{\pi}.

Identifying terms with our previous expression we can see that:

Corr(IA,IB)=2πarcsinCorr(X,Y).\text{Corr}(I_A, I_B) = \frac{2}{\pi} \arcsin \text{Corr}(X, Y).

As a side note, the value Corr(IA,IB)=0.2\text{Corr}(I_A, I_B) = 0.2 we used in the previous section comes precisely from applying this equation to our original correlation of 0.3. Since arcsin1=π2,Corr(X,Y)=1Corr(IA,IB)=1\arcsin 1 = \frac{\pi}{2}, \medspace \text{Corr}(X, Y) = 1 \medspace \Rightarrow \medspace \text{Corr}(I_A, I_B) = 1, which seems natural. Equality holds at 0 too, and for the values in between the binary correlation is slightly smaller in magnitude (absolute value), reaching a maximum deviation of less than 0.25 at around 0.75, as can be seen in the plot below.

image