After completing this tutorial, you will know:
Variables within a dataset can be related for lots of reasons.
A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.
The performance of some algorithms can deteriorate if two or more variables are tightly related, called multicollinearity.
Before we look at correlation methods, let’s define a dataset we can use to test the methods. We will generate 1,000 samples of two two variables with a strong positive correlation. The first variable will be random numbers drawn from a Gaussian distribution with a mean of 100 and a standard deviation of 20. The second variable will be values from the first variable with Gaussian noise added with a mean of a 50 and a standard deviation of 10. We will use the randn() function to generate random Gaussian values with a mean of 0 and a standard deviation of 1, then multiply the results by our own standard deviation and add the mean to shift the values into the preferred range. The pseudorandom number generator is seeded to ensure that we get the same sample of numbers each time the code is run.
- # generate related variables
- from numpy import mean
- from numpy import std
- from numpy.random import randn
- from numpy.random import seed
- from matplotlib import pyplot
- # seed random number generator
- seed(1)
- # prepare data
- data1 = 20 * randn(1000) + 100
- data2 = data1 + (10 * randn(1000) + 50)
-
- # summarize
- print('data1: mean=%.3f stdv=%.3f' % (mean(data1),std(data1)))
- print('data2: mean=%.3f stdv=%.3f' % (mean(data2),std(data2)))
- # plot
- pyplot.scatter(data1, data2)
- pyplot.show()
Running the example first prints the mean and standard deviation for each variable.
A scatter plot of the two variables is created. Because we contrived the dataset, we know there is a relationship between the two variables. This is clear when we review the generated scatter plot where we can see an increasing trend.

Variables can be related by a linear relationship. This is a relationship that is consistently additive across the two data samples. This relationship can be summarized between two variables, called the covariance. It is calculated as the average of the product between the values from each sample, where the values haven been centered (had their mean subtracted). The calculation of the sample covariance is as follows:

The use of the mean in the calculation suggests the need for each data sample to have a Gaussian or Gaussian-like distribution. The sign of the covariance can be interpreted as whether the two variables change in the same direction (positive) or change in different directions (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent. The cov() NumPy function can be used to calculate a covariance matrix between two or more variables.
- ...
-
- # calculate the covariance between two samples
-
- covariance = cov(data1, data2)
The diagonal of the matrix contains the covariance between each variable and itself. The other values in the matrix represent the covariance between the two variables; in this case, the remaining two values are the same given that we are calculating the covariance for only two variables. We can calculate the covariance matrix for the two variables in our test problem. The complete example is listed below.
- # calculate the covariance between two variables
- from numpy.random import randn
- from numpy.random import seed
- from numpy import cov
- # seed random number generator
- seed(1)
- # prepare data
- data1 = 20 * randn(1000) + 100
- data2 = data1 + (10 * randn(1000) + 50)
- # calculate covariance matrix
- covariance = cov(data1, data2)
- print(covariance)
A problem with covariance as a statistical tool alone is that it is challenging to interpret. This leads us to the Pearson’s correlation coefficient next.
The Pearson’s correlation coefficient (named for Karl Pearson) can be used to summarize the strength of the linear relationship between two data samples. The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalization of the covariance between the two variables to give an interpretable score.

The use of mean and standard deviation in the calculation suggests the need for the two data samples to have a Gaussian or Gaussian-like distribution. The result of the calculation, the correlation coefficient can be interpreted to understand the relationship. The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation. See the table below to help with interpretation the correlation coefficient.
The Pearson’s correlation is a statistical hypothesis test that does assume that there is no relationship between the samples (null hypothesis). The p-value can be interpreted as follows:
The pearsonr() SciPy function can be used to calculate the Pearson’s correlation coefficient between two data samples with the same length. We can calculate the correlation between the two variables in our test problem. The complete example is listed below.
- # calculate the pearson's correlation between two variables
- from numpy.random import randn
- from numpy.random import seed
- from scipy.stats import pearsonr
- # seed random number generator
- seed(1)
- # prepare data
- data1 = 20 * randn(1000) + 100
- data2 = data1 + (10 * randn(1000) + 50)
- # calculate Pearson's correlation
- corr,p = pearsonr(data1, data2)
- # display the correlation
- print('Pearsons correlation: %.3f' % corr)
- # interpret the significance
- alpha = 0.05
- if p > alpha:
- print('No correlation (fail to reject HO)')
- else:
- print('Some correlation (reject H0)')
Running the example calculates and prints the Pearson’s correlation coefficient and interprets the p-value. We can see that the two variables are positively correlated and that the correlation is 0.888. This suggests a high level of correlation (as we expected).
