• About Covariance and Correlation(协方差和相关)


    After completing this tutorial, you will know:

    • How to calculate a covariance matrix to summarize the linear relationship between two or more variables.
    • How to calculate the covariance to summarize the linear relationship between two variables.
    • How to calculate the Pearson’s correlation coefficient to summarize the linear relationship between two variables.

    1.1 Tutorial Overview

    • What is Correlation?
    • Test Dataset
    • Covariance
    • Person's Correlation

    1.2 What is Correlation?

    Variables within a dataset can be related for lots of reasons.

    • One variable could cause or depend on the values of another variable
    • One variable could be lightly associated with another variable.
    • Two variables could depend on a third unknown variable.

    A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.

    • Positive Correlation: Both variables change in the same direction.
    • Neutral Correlation: No relationship in the change of the variables.
    • Negative Correlation: Variables change in opposite directions.

    The performance of some algorithms can deteriorate if two or more variables are tightly related, called multicollinearity.

    1.3 Test Dataset

    Before we look at correlation methods, let’s define a dataset we can use to test the methods. We will generate 1,000 samples of two two variables with a strong positive correlation. The first variable will be random numbers drawn from a Gaussian distribution with a mean of 100 and a standard deviation of 20. The second variable will be values from the first variable with Gaussian noise added with a mean of a 50 and a standard deviation of 10. We will use the randn() function to generate random Gaussian values with a mean of 0 and a standard deviation of 1, then multiply the results by our own standard deviation and add the mean to shift the values into the preferred range. The pseudorandom number generator is seeded to ensure that we get the same sample of numbers each time the code is run.

    1. # generate related variables
    2. from numpy import mean
    3. from numpy import std
    4. from numpy.random import randn
    5. from numpy.random import seed
    6. from matplotlib import pyplot
    7. # seed random number generator
    8. seed(1)
    9. # prepare data
    10. data1 = 20 * randn(1000) + 100
    11. data2 = data1 + (10 * randn(1000) + 50)
    12. # summarize
    13. print('data1: mean=%.3f stdv=%.3f' % (mean(data1),std(data1)))
    14. print('data2: mean=%.3f stdv=%.3f' % (mean(data2),std(data2)))
    15. # plot
    16. pyplot.scatter(data1, data2)
    17. pyplot.show()

    Running the example first prints the mean and standard deviation for each variable.

    A scatter plot of the two variables is created. Because we contrived the dataset, we know there is a relationship between the two variables. This is clear when we review the generated scatter plot where we can see an increasing trend.

     1.4 Covariance

    Variables can be related by a linear relationship. This is a relationship that is consistently additive across the two data samples. This relationship can be summarized between two variables, called the covariance. It is calculated as the average of the product between the values from each sample, where the values haven been centered (had their mean subtracted). The calculation of the sample covariance is as follows:

     The use of the mean in the calculation suggests the need for each data sample to have a Gaussian or Gaussian-like distribution. The sign of the covariance can be interpreted as whether the two variables change in the same direction (positive) or change in different directions (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent. The cov() NumPy function can be used to calculate a covariance matrix between two or more variables.

    1. ...
    2. # calculate the covariance between two samples
    3. covariance = cov(data1, data2)

    The diagonal of the matrix contains the covariance between each variable and itself. The other values in the matrix represent the covariance between the two variables; in this case, the remaining two values are the same given that we are calculating the covariance for only two variables. We can calculate the covariance matrix for the two variables in our test problem. The complete example is listed below.

    1. # calculate the covariance between two variables
    2. from numpy.random import randn
    3. from numpy.random import seed
    4. from numpy import cov
    5. # seed random number generator
    6. seed(1)
    7. # prepare data
    8. data1 = 20 * randn(1000) + 100
    9. data2 = data1 + (10 * randn(1000) + 50)
    10. # calculate covariance matrix
    11. covariance = cov(data1, data2)
    12. print(covariance)

    A problem with covariance as a statistical tool alone is that it is challenging to interpret. This leads us to the Pearson’s correlation coefficient next.

    1.5 Pearson's Correlation

    The Pearson’s correlation coefficient (named for Karl Pearson) can be used to summarize the strength of the linear relationship between two data samples. The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalization of the covariance between the two variables to give an interpretable score.

     The use of mean and standard deviation in the calculation suggests the need for the two data samples to have a Gaussian or Gaussian-like distribution. The result of the calculation, the correlation coefficient can be interpreted to understand the relationship. The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation. See the table below to help with interpretation the correlation coefficient.

    The Pearson’s correlation is a statistical hypothesis test that does assume that there is no relationship between the samples (null hypothesis). The p-value can be interpreted as follows:

    • p-value ≤ alpha: significant result, reject null hypothesis, some relationship (H1).
    • p-value > alpha: not significant result, fail to reject null hypothesis, no relationship (H0).

    The pearsonr() SciPy function can be used to calculate the Pearson’s correlation coefficient between two data samples with the same length. We can calculate the correlation between the two variables in our test problem. The complete example is listed below.

    1. # calculate the pearson's correlation between two variables
    2. from numpy.random import randn
    3. from numpy.random import seed
    4. from scipy.stats import pearsonr
    5. # seed random number generator
    6. seed(1)
    7. # prepare data
    8. data1 = 20 * randn(1000) + 100
    9. data2 = data1 + (10 * randn(1000) + 50)
    10. # calculate Pearson's correlation
    11. corr,p = pearsonr(data1, data2)
    12. # display the correlation
    13. print('Pearsons correlation: %.3f' % corr)
    14. # interpret the significance
    15. alpha = 0.05
    16. if p > alpha:
    17. print('No correlation (fail to reject HO)')
    18. else:
    19. print('Some correlation (reject H0)')

    Running the example calculates and prints the Pearson’s correlation coefficient and interprets the p-value. We can see that the two variables are positively correlated and that the correlation is 0.888. This suggests a high level of correlation (as we expected).

  • 相关阅读:
    【信号处理】卡尔曼滤波(Matlab代码实现)
    《实验细节》上手使用PEFT库方法和常见出错问题
    【ASeeker】Android 源码捞针,服务接口扫描神器
    [附源码]计算机毕业设计springboot校园疫情防范管理系统
    智慧公厕整体解决方案,厕所革命实施方案的范本
    JS应用案例:时钟,国庆倒计时
    Qt学习--构造函数&析构函数
    记录一次通过社工进网站后台
    【Leetcode刷题Python】密码校验
    Leetcode--剑指Offer
  • 原文地址:https://blog.csdn.net/u011868279/article/details/125501466