Correlation Coefficients

Correlations and Pearson Coefficient:

Here, our objective is to generate two sets of N numbers which have prescribed Mean and Standard Deviation and whose Pearson Correlation coefficient is also prescribed.

First we generate two collections of N numbers:
a₁, a₂, ... (that is, a_k for k = 1, 2, ... N), with Mean = 0 and Standard Deviation = 1
and b₁, b₂, ... which (like the a_k) has Mean = 0 and SD = 1.
(1) (1/N)Σa_k = 0 and (1/N)Σa_k² = 1
(2) (1/N)Σb_k = 0 and (1/N)Σb_k² = 1

If we're satisfied with normally distributed numbers, we can do this with Excel, using NORMINV(RAND(),0,1)
Further, we assume that these sets of numbers have Correlation = 0
... which would be the case if they were generated using NORMINV().
Then:
(3) (1/N)Σa_k b_k = 0.

From the above numbers we construct another collection, collection 1, namely
x₁ = M₁ + S₁a₁, x₂ = M₁ + S₁a₂, ... (which we represent by x_k = M₁ + S₁a_k for k = 1, 2, ... N).

This collection will have mean = M₁ and SD = S₁, since
(1/N)Σx_k = (1/N)Σ M₁ + S₁ (1/N)Σa_k = M₁ + S₁*0 = M₁ and
(1/N)Σ(x_k - M₁)² = S₁² (1/N)Σa_k² = S₁²*1 = S₁²

Similarly, we construct the collection 2, based upon the b_k, namely
y_k = M₂ + S₂b₁ for k = 1, 2, ... N.
This collection will have mean = M₂ and SD = S₂,

So far we've been able to construct two sets of numbers (x and y) with prescribed Means and Standard Deviations, starting with two sets (a and b) with Mean = 0 and SD = 1. Now we work on the Pearson Correlation r:
r(x,y) = { (1/N)Σ (x_k - M₁) (y_k - M₂) } / { S₁ S₂ } where the numerator is the Covariance between the x and y sets.

Hence:
r(x,y) = { (1/N)Σ S₁ a_k S₂b_k } / { S₁ S₂ } = (1/N)Σ a_k b_k = r(a,b)
so the correlation for the sets x and y is the same as for the sets a and b.
However, we're asuming that r(a,b) = 0, so r(x,y) = 0

Now we generate a u-set, like so: u_k = M₁ + S₁ (A a_k + B b_k) where A and B are as-yet-unknown constants.
Note that the Mean of this u-set is:
(1/N)Σu_k = M₁ + S₁(A/N)Σa_k + S₁(B/N)Σb_k = M₁ + 0 = M₁ the same Mean as the x-set, namely M₁.

Further, the Standard Deviation of this u-set is determined from:
SD² = (1/N)Σ(u_k - M₁)² = S₁²(1/N)Σ (A a_k + B b_k)² = S₁²(1/N) { A² Σa_k² + B²Σb_k² + 2ABΣa_kb_k }

So, using (1), (2) and (3) we get:
(4) SD² = S₁²( A² 1 + B² 1 + 2AB 0 } = S₁² (A² + B²)
and, choosing A² + B² = 1, we get (finally):
SD² = S₁² so the u-set has the same Standard Deviation the x-set, namely S₁.

So far we've managed to modify the x-set, creating a u-set, yet maintaining the Mean and SD.

Now, we calculate the correlation:

r(u,y) = { (1/N)Σ (u_k - M₁) (y_k - M₂) } / { S₁ S₂ } = (1/N)Σ (A a_k + B b_k) b_k = A 0 + B 1 = B using (2) and (3).

Finally, then, we can start with uncorrelated sets x and y, specify a correlation B, and construct a u-set with the same Mean and SD as the x-set, namely M₁ and S₁ ... but with the specified correlation, via u_k = M₁ + S₁ { SQRT(1-B²)a_k + B b_k }

Now we consider THREE sets: a_k, b_k and c_k (with Mean = 0 and SD = 1) and three derived sets: x_k, y_k and z_k ... all of which are uncorrelated.

We generate a u-set (as we did above) according to:
u_k = M₁ + S₁ { A a_k + B b_k + C c_k } with Mean M₁ and Standard Deviation^* S₁ the same as the x-set

^* Note:
For the u-set, SD² = S₁ ² (A² + B² + C²) so SD = S₁ provided A² + B² + C² = 1

Now, for the correlations:

r(u,y) = { (1/N)Σ (u_k - M₁) (y_k - M₂) } / { S₁ S₂ } = (1/N)Σ (A a_k + B b_k + C c_k) b_k = A 0 + B 1 + C 0 = B

and
r(u,z) = { (1/N)Σ (u_k - M₁) (z_k - M₃) } / { S₁ S₃ } = (1/N)Σ (A a_k + B b_k + C c_k) c_k = A 0 + B 0 + C 1 = C

We now have sets u, v and y with prescribed Means and Standard Deviations, namely (M₁,S₁), (M₂,S₂) and (M₃,S₃), and correlations r(u,y) = B, r(u,z) = C.