最近在做特征选择的练习题时,要考量很多个特征的相关性,并且特征之间是非线性关系的,我们常见的皮尔逊相关系数无法准确使用,考虑使用距离相关系数来度量:
距离相关系数:研究两个变量之间的独立性,距离相关系数为0表示两个变量是独立的。克服了皮尔逊相关系数的弱点,皮尔逊相关系数为0并不一定表示两个变量之间是独立的,也有可能是非线性相关的。
利用 D i s t a n c e C o r r e l a t i o n Distance\;\;Correlation DistanceCorrelation来研究两个变量 μ \mu μ和 v v v之间的独立性,记为 d c o r r ( μ , v ) dcorr(\mu\;,\;v) dcorr(μ,v)。
当 d c o r r ( μ , v ) dcorr(\mu\;,\;v) dcorr(μ,v)=0时,说明两个变量 μ \mu μ和 v v v之间独立; d c o r r ( μ , v ) dcorr(\mu\;,\;v) dcorr(μ,v)越大,说明 μ \mu μ和 v v v之间的相关性越强。
设
{
(
μ
i
,
v
i
)
,
i
=
1
,
2
,
⋯
,
n
}
\{(\mu_i\;,\;v_i)\;,\;i=1,2,\cdots,n\}
{(μi,vi),i=1,2,⋯,n}是总体
(
μ
,
v
)
(\mu\;,\;v)
(μ,v)的随机样本,
S
z
e
k
e
l
y
Szekely
Szekely等定义两个随机变量
μ
\mu
μ和
v
v
v的DC样本估计值为:
d
c
o
r
r
(
μ
,
v
)
=
d
c
o
v
(
μ
,
v
)
d
c
o
v
(
μ
,
μ
)
d
c
o
v
(
v
,
v
)
dcorr(\mu\;,\;v)=\frac{dcov(\mu\;,\;v)}{\sqrt{dcov(\mu\;,\;\mu)dcov(v\;,\;v)}}
dcorr(μ,v)=dcov(μ,μ)dcov(v,v)dcov(μ,v)
其中, d c o v 2 ( μ , v ) = S 1 ^ + S 2 ^ − 2 S 3 ^ dcov^2(\mu\;,\;v)=\hat{S_1}+\hat{S_2}-2\hat{S_3} dcov2(μ,v)=S1^+S2^−2S3^, S 1 ^ , S 2 ^ , S 3 ^ \hat{S_1}\;,\;\hat{S_2}\;,\;\hat{S_3} S1^,S2^,S3^分别为:
S 1 ^ = 1 n 2 ∑ i = 1 n ∑ j = 1 n ∣ ∣ μ i − μ j ∣ ∣ d μ ∣ ∣ v i − v j ∣ ∣ d v \hat{S_1}=\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n||\mu_i-\mu_j||_{d\mu}||v_i-v_j||_{dv} S1^=n21i=1∑nj=1∑n∣∣μi−μj∣∣dμ∣∣vi−vj∣∣dv
S 2 ^ = 1 n 2 ∑ i = 1 n ∑ j = 1 n ∣ ∣ μ i − μ j ∣ ∣ d μ 1 n 2 ∑ i = 1 n ∑ j = 1 n ∣ ∣ v i − v j ∣ ∣ d v \hat{S_2}=\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n||\mu_i-\mu_j||_{d\mu}\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n||v_i-v_j||_{dv} S2^=n21i=1∑nj=1∑n∣∣μi−μj∣∣dμn21i=1∑nj=1∑n∣∣vi−vj∣∣dv
S 3 ^ = 1 n 3 ∑ i = 1 n ∑ j = 1 n ∑ l = 1 n ∣ ∣ μ i − μ l ∣ ∣ d μ ∣ ∣ v j − v l ∣ ∣ d v \hat{S_3}=\frac{1}{n^3}\sum_{i=1}^n\sum_{j=1}^n\sum_{l=1}^n||\mu_i-\mu_l||_{d\mu}||v_j-v_l||_{dv} S3^=n31i=1∑nj=1∑nl=1∑n∣∣μi−μl∣∣dμ∣∣vj−vl∣∣dv
from scipy.spatial.distance import pdist, squareform
import numpy as np
from numbapro import jit, float32
def distcorr(X, Y):
""" Compute the distance correlation function
>>> a = [1,2,3,4,5]
>>> b = np.array([1,2,9,4,4])
>>> distcorr(a, b)
0.762676242417
"""
X = np.atleast_1d(X)
Y = np.atleast_1d(Y)
if np.prod(X.shape) == len(X):
X = X[:, None]
if np.prod(Y.shape) == len(Y):
Y = Y[:, None]
X = np.atleast_2d(X)
Y = np.atleast_2d(Y)
n = X.shape[0]
if Y.shape[0] != X.shape[0]:
raise ValueError('Number of samples must match')
a = squareform(pdist(X))
b = squareform(pdist(Y))
A = a - a.mean(axis=0)[None, :] - a.mean(axis=1)[:, None] + a.mean()
B = b - b.mean(axis=0)[None, :] - b.mean(axis=1)[:, None] + b.mean()
dcov2_xy = (A * B).sum()/float(n * n)
dcov2_xx = (A * A).sum()/float(n * n)
dcov2_yy = (B * B).sum()/float(n * n)
dcor = np.sqrt(dcov2_xy)/np.sqrt(np.sqrt(dcov2_xx) * np.sqrt(dcov2_yy))
return dcor
这篇代码来自国外一位大佬所写,原地址为:
https://gist.github.com/satra/aa3d19a12b74e9ab7941
这个的难点在于安装
d
c
o
r
dcor
dcor包的
c
o
n
d
a
conda
conda版本,我用pip install dcor
总是报错,无法解决错误,也不是博客里提供的错误里类型。
经过一番探索,我确定了最终的安装方法:
conda install -c conda-forge dcor
这样能顺利安装成功!
>>> a = [1,2,3,4,5]
>>> b = np.array([1,2,9,4,4])
>>> dcor.distance_correction(a, b)
0.762676242417
具体的操作细节可以看官网:
https://dcor.readthedocs.io/en/latest/apilist.html