• MachineLearning 23. 机器学习之岭回归预测基因型和表型 (Ridge)


    0f596ea1917f7478b8eefcca30fb634d.png


    简          介

    岭回归(英文名:ridge regression, Tikhonov regularization)是一种专用于共线性数据分析的有偏估计回归方法,实质上是一种改良的最小二乘估计法,通过放弃最小二乘法的无偏性,以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法,对病态数据的拟合要强于最小二乘法。

    迄今为止,许多遗传变异已被确定为与不同的表型性状相关。然而,已确定的关联通常只能解释性状遗传性的一小部分,并且仅包含已知相关变异的模型的预测能力很小。多元回归是一种流行的框架,它同时考虑许多遗传变异的联合效应。由于遗传数据的高维数和预测因子之间的相关结构,普通多元回归很少适用于遗传数据。人们重新对使用惩罚性回归技术来规避这些困难产生了兴趣。在本文中着重于岭回归,一种惩罚回归方法,已被证明在多变量预测中提供良好的性能问题。岭回归应用中的一个挑战是控制量的岭参数的选择收缩的回归系数。我们提出了一种基于数据确定脊形参数的方法。

    目的:在高维预测问题中获得良好的性能。为我们的方法建立了理论依据,并在模拟遗传数据和实际数据实例上验证了其性能。岭回归模型拟合数十万到数百万的基因变异同时存在,这给计算带来了挑战。

    feb29ff94325fd27e61e4d01fe2938f6.png

    软件包安装

    ridge 软件包安装非常简单,直接install.packages()即可。

    1. install.packages("ridge")
    2. devtools::install_github("selva86/InformationValue")

    数据读取

    输入:自变量X至少一项或以上的定量变量或二分类定类变量,因变量Y要求为定量变量(若为定类变量,请使用逻辑回归)。 输出:模型检验优度的结果,自变量对因变量的线性关系等等。

    表型为数值型变量

    1. library(ridge)
    2. data(GenCont)
    3. head(GenCont[, 1:10])
    4. ## Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9
    5. ## [1,] 1.4356316 1 1 1 0 0 1 0 0 0
    6. ## [2,] 2.9226960 1 0 0 0 1 0 0 1 0
    7. ## [3,] 0.5669319 0 0 0 0 0 0 0 0 0
    8. ## [4,] 4.8515051 1 0 0 0 1 0 0 0 0
    9. ## [5,] 0.1525582 1 1 1 0 0 1 0 0 0
    10. ## [6,] 2.9522701 1 0 0 0 1 0 0 0 0

    表型为分类型变量

    1. data(GenBin)
    2. head(GenBin)
    3. ## Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
    4. ## [1,] 0 1 0 0 0 0 0 0 1 0 1 0 0
    5. ## [2,] 0 0 0 0 0 0 0 1 0 0 0 0 0
    6. ## [3,] 1 0 0 0 0 0 0 2 0 1 0 0 1
    7. ## [4,] 0 0 0 0 0 0 1 0 0 0 0 0 0
    8. ## [5,] 0 0 0 0 0 0 0 1 0 0 0 0 0
    9. ## [6,] 0 0 0 0 0 0 0 0 0 0 0 0 0
    10. ## SNP13 SNP14
    11. ## [1,] 0 1
    12. ## [2,] 1 0
    13. ## [3,] 1 0
    14. ## [4,] 0 0
    15. ## [5,] 0 0
    16. ## [6,] 0 0

    实例操作

    基于乳腺癌分类的逻辑岭回归模型

    这个我们采用glmnet里面的函数和参数解决实际的问题,也可以使用R包ridge,常规数据我们可以模仿Lasso回归的模式来做。

    数据读取

    这个数据集经常使用,在分享的机器学习里面的例子中经常出现,并且有对数据进行处理的常规步骤:

    1. library(glmnet)
    2. library(caret)
    3. BreastCancer <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
    4. BreastCancer = BreastCancer[, -1]
    5. dim(BreastCancer)
    6. ## [1] 568 31
    7. str(BreastCancer)
    8. ## 'data.frame': 568 obs. of 31 variables:
    9. ## $ diagnosis : chr "M" "M" "M" "M" ...
    10. ## $ radius_mean : num 20.6 19.7 11.4 20.3 12.4 ...
    11. ## $ texture_mean : num 17.8 21.2 20.4 14.3 15.7 ...
    12. ## $ perimeter_mean : num 132.9 130 77.6 135.1 82.6 ...
    13. ## $ area_mean : num 1326 1203 386 1297 477 ...
    14. ## $ smoothness_mean : num 0.0847 0.1096 0.1425 0.1003 0.1278 ...
    15. ## $ compactne_mean : num 0.0786 0.1599 0.2839 0.1328 0.17 ...
    16. ## $ concavity_mean : num 0.0869 0.1974 0.2414 0.198 0.1578 ...
    17. ## $ concave_points_mean : num 0.0702 0.1279 0.1052 0.1043 0.0809 ...
    18. ## $ symmetry_mean : num 0.181 0.207 0.26 0.181 0.209 ...
    19. ## $ fractal_dimension_mean : num 0.0567 0.06 0.0974 0.0588 0.0761 ...
    20. ## $ radius_se : num 0.543 0.746 0.496 0.757 0.335 ...
    21. ## $ texture_se : num 0.734 0.787 1.156 0.781 0.89 ...
    22. ## $ perimeter_se : num 3.4 4.58 3.44 5.44 2.22 ...
    23. ## $ area_se : num 74.1 94 27.2 94.4 27.2 ...
    24. ## $ smoothness_se : num 0.00522 0.00615 0.00911 0.01149 0.00751 ...
    25. ## $ compactne_se : num 0.0131 0.0401 0.0746 0.0246 0.0335 ...
    26. ## $ concavity_se : num 0.0186 0.0383 0.0566 0.0569 0.0367 ...
    27. ## $ concave_points_se : num 0.0134 0.0206 0.0187 0.0188 0.0114 ...
    28. ## $ symmetry_se : num 0.0139 0.0225 0.0596 0.0176 0.0216 ...
    29. ## $ fractal_dimension_se : num 0.00353 0.00457 0.00921 0.00511 0.00508 ...
    30. ## $ radius_worst : num 25 23.6 14.9 22.5 15.5 ...
    31. ## $ texture_worst : num 23.4 25.5 26.5 16.7 23.8 ...
    32. ## $ perimeter_worst : num 158.8 152.5 98.9 152.2 103.4 ...
    33. ## $ area_worst : num 1956 1709 568 1575 742 ...
    34. ## $ smoothness_worst : num 0.124 0.144 0.21 0.137 0.179 ...
    35. ## $ compactne_worst : num 0.187 0.424 0.866 0.205 0.525 ...
    36. ## $ concavity_worst : num 0.242 0.45 0.687 0.4 0.535 ...
    37. ## $ concave_points_worst : num 0.186 0.243 0.258 0.163 0.174 ...
    38. ## $ symmetry_worst : num 0.275 0.361 0.664 0.236 0.399 ...
    39. ## $ fractal_dimension_worst: num 0.089 0.0876 0.173 0.0768 0.1244 ...
    40. table(BreastCancer$diagnosis)
    41. ##
    42. ## B M
    43. ## 357 211
    44. # 删除方差为0的变量
    45. zerovar = nearZeroVar(BreastCancer[, -1])
    46. zerovar
    47. ## integer(0)
    48. # 首先删除强相关的变量
    49. descrCorr = cor(BreastCancer[, -1])
    50. descrCorr[1:5, 1:5]
    51. ## radius_mean texture_mean perimeter_mean area_mean
    52. ## radius_mean 1.0000000 0.32938305 0.9978764 0.9873442
    53. ## texture_mean 0.3293830 1.00000000 0.3359176 0.3261929
    54. ## perimeter_mean 0.9978764 0.33591759 1.0000000 0.9865482
    55. ## area_mean 0.9873442 0.32619289 0.9865482 1.0000000
    56. ## smoothness_mean 0.1680940 -0.01776898 0.2045046 0.1748380
    57. ## smoothness_mean
    58. ## radius_mean 0.16809398
    59. ## texture_mean -0.01776898
    60. ## perimeter_mean 0.20450464
    61. ## area_mean 0.17483805
    62. ## smoothness_mean 1.00000000
    63. highCorr = findCorrelation(descrCorr, 0.9)
    64. BreastCancer = BreastCancer[, -(highCorr + 1)]
    65. # 随后解决多重共线性,本例中不存在多重共线性问题
    66. comboInfo = findLinearCombos(BreastCancer[, -1])
    67. Process = preProcess(BreastCancer)
    68. BreastCancer = predict(Process, BreastCancer)
    数据分割
    1. library(tidyverse)
    2. library(sampling)
    3. set.seed(123)
    4. # 每层抽取70%的数据
    5. train_id <- strata(BreastCancer, "diagnosis", size = rev(round(table(BreastCancer$diagnosis) *
    6. 0.7)))$ID_unit
    7. # 训练数据
    8. trainData <- BreastCancer[train_id, ]
    9. # 测试数据
    10. testData <- BreastCancer[-train_id, ]
    11. # 查看训练、测试数据中正负样本比例
    12. prop.table(table(trainData$diagnosis))
    13. ##
    14. ## B M
    15. ## 0.6281407 0.3718593
    16. prop.table(table(testData$diagnosis))
    17. ##
    18. ## B M
    19. ## 0.6294118 0.3705882
    20. prop.table(table(BreastCancer$diagnosis))
    21. ##
    22. ## B M
    23. ## 0.6285211 0.3714789
    可视化重要变量
    1. #4. How to visualize the importance of variables using featurePlot()
    2. featurePlot(x = trainData[, 2:21],
    3. y = as.factor(trainData$diagnosis),
    4. plot = "box", #For classification:box, strip, density, pairs or ellipse. For regression, pairs or scatter
    5. strip=strip.custom(par.strip.text=list(cex=.7)),
    6. scales = list(x = list(relation="free"),
    7. y = list(relation="free"))
    8. )

    6ae9171079d50ec15aaf8aaa9e5b6932.png

    构建输入数据
    1. x = as.matrix(trainData[, -1])
    2. y = ifelse(trainData$diagnosis == "M", 1, 0)
    构建逻辑岭回归模型
    1. ridge <- glmnet(x, y, family = "binomial", alpha = 0)
    2. plot(ridge, xvar = "lambda", label = TRUE)

    3684e65d2b83447fcf0db4c6d52fcc93.png

    交叉验证
    1. ridge_model <- cv.glmnet(x, y, alpha = 0, nfolds = 3)
    2. plot(ridge_model)

    3df686881b167f5540b7297f040ca2be.png

    plot(ridge_model$glmnet.fit, "lambda", label = T)

    e45592e524fcc0749ae120ffa7919edd.png

    确定模型均方误差最小的情况下参数lambda的取值

    在cv.glmnet()函数的输出结果中,包含一个lambda.min取值,且ridge_model$lambda.min = 0.15,该值为在模型均方误差最小的情况下参数lambda的取值,可以使用该值建立Ridge回归模型。下面使用该参数对全部的训练数据集建立Ridge模型。

    1. ridge_min <- ridge_model$lambda.min
    2. ## 使用ridge_min 拟合ridge模型
    3. ridge_best <- glmnet(x, y, alpha = 0, lambda = ridge_min)
    4. coef(ridge_best)
    5. ## 21 x 1 sparse Matrix of class "dgCMatrix"
    6. ## s0
    7. ## (Intercept) 0.3729332067
    8. ## area_mean 0.0885358564
    9. ## smoothness_mean 0.0052045287
    10. ## compactne_mean -0.0003435624
    11. ## symmetry_mean 0.0075012310
    12. ## fractal_dimension_mean -0.0839607648
    13. ## radius_se 0.0636875678
    14. ## texture_se 0.0027998059
    15. ## smoothness_se 0.0388690882
    16. ## compactne_se -0.0551225591
    17. ## concavity_se -0.0524564964
    18. ## concave_points_se 0.0598000796
    19. ## symmetry_se -0.0177510728
    20. ## fractal_dimension_se 0.0119540616
    21. ## texture_worst 0.0557146338
    22. ## smoothness_worst 0.0310525197
    23. ## compactne_worst -0.0004504186
    24. ## concavity_worst 0.0892093379
    25. ## concave_points_worst 0.1296407274
    26. ## symmetry_worst 0.0614660535
    27. ## fractal_dimension_worst 0.0430986166
    验证集验证
    1. ridge.coef <- predict(ridge_model, s = 0.05, type = "coefficients")
    2. ridge.coef
    3. ## 21 x 1 sparse Matrix of class "dgCMatrix"
    4. ## s1
    5. ## (Intercept) 0.3729040858
    6. ## area_mean 0.0894265533
    7. ## smoothness_mean 0.0055832621
    8. ## compactne_mean 0.0009547588
    9. ## symmetry_mean 0.0076406106
    10. ## fractal_dimension_mean -0.0794539760
    11. ## radius_se 0.0632058362
    12. ## texture_se 0.0022559684
    13. ## smoothness_se 0.0356625085
    14. ## compactne_se -0.0512845944
    15. ## concavity_se -0.0465064056
    16. ## concave_points_se 0.0570207458
    17. ## symmetry_se -0.0175198162
    18. ## fractal_dimension_se 0.0076460181
    19. ## texture_worst 0.0559848075
    20. ## smoothness_worst 0.0332178942
    21. ## compactne_worst 0.0054483603
    22. ## concavity_worst 0.0835264079
    23. ## concave_points_worst 0.1249131201
    24. ## symmetry_worst 0.0592973698
    25. ## fractal_dimension_worst 0.0381440680
    26. ridge.y <- predict(ridge_model, newx = as.matrix(testData[, -1]), type = "response",
    27. s = 0.05)
    绘制ROC曲线
    1. library(InformationValue)
    2. actuals <- ifelse(testData$diagnosis == "M", 1, 0)
    3. misClassError(actuals, ridge.y)
    4. ## [1] 0.0588
    5. plotROC(actuals, ridge.y)

    09c71392da62d6c0996ece26e2913ca9.png

    基于基因型和表型数据构建岭回归模型

    这样的数据我们使用特殊的软件包 ridge 来做,非常方便,并且里面内置了基因型预测函数,非常方便对基因型和表型的数据进行处理。

    构建逻辑岭回归模型
    1. data(GenBin)
    2. head(GenBin)
    3. ## Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
    4. ## [1,] 0 1 0 0 0 0 0 0 1 0 1 0 0
    5. ## [2,] 0 0 0 0 0 0 0 1 0 0 0 0 0
    6. ## [3,] 1 0 0 0 0 0 0 2 0 1 0 0 1
    7. ## [4,] 0 0 0 0 0 0 1 0 0 0 0 0 0
    8. ## [5,] 0 0 0 0 0 0 0 1 0 0 0 0 0
    9. ## [6,] 0 0 0 0 0 0 0 0 0 0 0 0 0
    10. ## SNP13 SNP14
    11. ## [1,] 0 1
    12. ## [2,] 1 0
    13. ## [3,] 1 0
    14. ## [4,] 0 0
    15. ## [5,] 0 0
    16. ## [6,] 0 0
    17. table(GenBin[, 1])
    18. ##
    19. ## 0 1
    20. ## 250 250
    21. beta_logisticRidge <- logisticRidge(Phenotypes ~ ., data = as.data.frame(GenBin))
    22. summary(beta_logisticRidge)
    23. ##
    24. ## Call:
    25. ## logisticRidge(formula = Phenotypes ~ ., data = as.data.frame(GenBin))
    26. ##
    27. ##
    28. ## Coefficients:
    29. ## Estimate Scaled estimate Std. Error (scaled) t value (scaled)
    30. ## (Intercept) 0.002782 NA NA NA
    31. ## SNP1 0.166067 1.914369 0.666082 2.874
    32. ## SNP2 0.110745 0.269635 0.963936 0.280
    33. ## SNP3 -0.284802 -0.491809 0.999027 -0.492
    34. ## SNP4 -1.007615 -1.422131 0.990998 -1.435
    35. ## SNP5 0.911082 0.910170 0.986458 0.923
    36. ## SNP6 -0.026877 -0.355500 0.959424 -0.371
    37. ## SNP7 -0.068540 -1.022005 0.922686 -1.108
    38. ## SNP8 0.149821 1.729257 0.667675 2.590
    39. ## SNP9 -0.159520 -0.795170 0.948859 -0.838
    40. ## SNP10 -0.207470 -1.214400 0.893062 -1.360
    41. ## SNP11 0.027842 0.067788 0.999575 0.068
    42. ## SNP12 0.090563 0.651787 0.958376 0.680
    43. ## SNP13 -0.051310 -0.705257 0.947386 -0.744
    44. ## SNP14 -0.196030 -0.990720 0.904846 -1.095
    45. ## Pr(>|t|)
    46. ## (Intercept) NA
    47. ## SNP1 0.00405 **
    48. ## SNP2 0.77969
    49. ## SNP3 0.62252
    50. ## SNP4 0.15127
    51. ## SNP5 0.35618
    52. ## SNP6 0.71098
    53. ## SNP7 0.26802
    54. ## SNP8 0.00960 **
    55. ## SNP9 0.40202
    56. ## SNP10 0.17389
    57. ## SNP11 0.94593
    58. ## SNP12 0.49644
    59. ## SNP13 0.45662
    60. ## SNP14 0.27356
    61. ## ---
    62. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    63. ##
    64. ## Ridge paramter: 0.1244006, chosen automatically, computed using 3 PCs
    65. ##
    66. ## Degrees of freedom: model 7.02 , variance 4.066
    67. pvalsMod <- pvals(beta_logisticRidge)
    68. print(pvalsMod)
    69. ##
    70. ## lambda 0.124401 chosen automatically using 3 PCs
    71. ##
    72. ## Estimate (scaled) Std. Error (scaled) t value (scaled) Pr(>|t|)
    73. ## SNP1 1.91437 0.66608 2.874 0.00405 **
    74. ## SNP2 0.26964 0.96394 0.280 0.77969
    75. ## SNP3 -0.49181 0.99903 -0.492 0.62252
    76. ## SNP4 -1.42213 0.99100 -1.435 0.15127
    77. ## SNP5 0.91017 0.98646 0.923 0.35618
    78. ## SNP6 -0.35550 0.95942 -0.371 0.71098
    79. ## SNP7 -1.02201 0.92269 -1.108 0.26802
    80. ## SNP8 1.72926 0.66767 2.590 0.00960 **
    81. ## SNP9 -0.79517 0.94886 -0.838 0.40202
    82. ## SNP10 -1.21440 0.89306 -1.360 0.17389
    83. ## SNP11 0.06779 0.99958 0.068 0.94593
    84. ## SNP12 0.65179 0.95838 0.680 0.49644
    85. ## SNP13 -0.70526 0.94739 -0.744 0.45662
    86. ## SNP14 -0.99072 0.90485 -1.095 0.27356
    87. ## ---
    88. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    89. plot(pvalsMod)

    9352d08073ba49f6b2a5a74fd1d7fa12.png

    1. pred_phen <- predict(beta_logisticRidge, type = "response")
    2. res <- cbind(GenBin, pred_phen)
    3. head(res)
    4. ## Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
    5. ## 1 0 1 0 0 0 0 0 0 1 0 1 0 0
    6. ## 2 0 0 0 0 0 0 0 1 0 0 0 0 0
    7. ## 3 1 0 0 0 0 0 0 2 0 1 0 0 1
    8. ## 4 0 0 0 0 0 0 1 0 0 0 0 0 0
    9. ## 5 0 0 0 0 0 0 0 1 0 0 0 0 0
    10. ## 6 0 0 0 0 0 0 0 0 0 0 0 0 0
    11. ## SNP13 SNP14
    12. ## 1 0 1 0.4788050
    13. ## 2 1 0 0.4707666
    14. ## 3 1 0 0.4367005
    15. ## 4 0 0 0.4939766
    16. ## 5 0 0 0.4835665
    17. ## 6 0 0 0.5006956
    预测基因型
    1. genotypesfile <- system.file("extdata", "GenBin_genotypes.txt", package = "ridge")
    2. phenotypesfile <- system.file("extdata", "GenBin_phenotypes.txt", package = "ridge")
    3. beta_logisticRidgeGenotypes <- logisticRidgeGenotypes(genotypesfilename = genotypesfile,
    4. phenotypesfilename = phenotypesfile)
    5. beta_logisticRidgeGenotypes
    6. ## B
    7. ## Intercept 0.002782
    8. ## SNP1 0.166067
    9. ## SNP2 0.110745
    10. ## SNP3 -0.284802
    11. ## SNP4 -1.007615
    12. ## SNP5 0.911082
    13. ## SNP6 -0.026877
    14. ## SNP7 -0.068540
    15. ## SNP8 0.149821
    16. ## SNP9 -0.159520
    17. ## SNP10 -0.207470
    18. ## SNP11 0.027842
    19. ## SNP12 0.090563
    20. ## SNP13 -0.051310
    21. ## SNP14 -0.196030
    22. ## compare to output of logisticRidge
    23. results <- as.data.frame(cbind(round(coef(beta_logisticRidge), 6), beta_logisticRidgeGenotypes))
    24. results
    25. ## round(coef(beta_logisticRidge), 6) B
    26. ## (Intercept) 0.002782 0.002782
    27. ## SNP1 0.166067 0.166067
    28. ## SNP2 0.110745 0.110745
    29. ## SNP3 -0.284802 -0.284802
    30. ## SNP4 -1.007615 -1.007615
    31. ## SNP5 0.911082 0.911082
    32. ## SNP6 -0.026877 -0.026877
    33. ## SNP7 -0.068540 -0.068540
    34. ## SNP8 0.149821 0.149821
    35. ## SNP9 -0.159520 -0.159520
    36. ## SNP10 -0.207470 -0.207470
    37. ## SNP11 0.027842 0.027842
    38. ## SNP12 0.090563 0.090563
    39. ## SNP13 -0.051310 -0.051310
    40. ## SNP14 -0.196030 -0.196030
    构建线性岭回归模型
    1. data(GenCont)
    2. head(GenCont)
    3. ## Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
    4. ## [1,] 1.4356316 1 1 1 0 0 1 0 0 0 0 1 0
    5. ## [2,] 2.9226960 1 0 0 0 1 0 0 1 0 0 1 0
    6. ## [3,] 0.5669319 0 0 0 0 0 0 0 0 0 0 2 0
    7. ## [4,] 4.8515051 1 0 0 0 1 0 0 0 0 0 1 0
    8. ## [5,] 0.1525582 1 1 1 0 0 1 0 0 0 0 1 0
    9. ## [6,] 2.9522701 1 0 0 0 1 0 0 0 0 0 0 0
    10. beta_linearRidge <- linearRidge(Phenotypes ~ ., data = as.data.frame(GenCont))
    11. summary(beta_linearRidge)
    12. ##
    13. ## Call:
    14. ## linearRidge(formula = Phenotypes ~ ., data = as.data.frame(GenCont))
    15. ##
    16. ##
    17. ## Coefficients:
    18. ## Estimate Scaled estimate Std. Error (scaled) t value (scaled)
    19. ## (Intercept) 1.533386 NA NA NA
    20. ## SNP1 0.277296 4.045409 0.266120 15.201
    21. ## SNP2 -0.110458 -1.256154 0.216332 5.807
    22. ## SNP3 -0.110458 -1.256154 0.216332 5.807
    23. ## SNP4 0.005230 0.011635 0.371693 0.031
    24. ## SNP5 0.531173 6.323006 0.315368 20.050
    25. ## SNP6 -0.119164 -1.373227 0.223047 6.157
    26. ## SNP7 0.113844 0.113730 0.372181 0.306
    27. ## SNP8 -0.099149 -1.028581 0.355807 2.891
    28. ## SNP9 -0.008321 -0.008312 0.372386 0.022
    29. ## SNP10 0.058562 0.101128 0.371567 0.272
    30. ## SNP11 -0.096526 -1.495699 0.329250 4.543
    31. ## SNP12 -0.334279 -0.333945 0.372248 0.897
    32. ## Pr(>|t|)
    33. ## (Intercept) NA
    34. ## SNP1 < 2e-16 ***
    35. ## SNP2 6.38e-09 ***
    36. ## SNP3 6.38e-09 ***
    37. ## SNP4 0.97503
    38. ## SNP5 < 2e-16 ***
    39. ## SNP6 7.43e-10 ***
    40. ## SNP7 0.75993
    41. ## SNP8 0.00384 **
    42. ## SNP9 0.98219
    43. ## SNP10 0.78549
    44. ## SNP11 5.55e-06 ***
    45. ## SNP12 0.36966
    46. ## ---
    47. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    48. ##
    49. ## Ridge parameter: 2.206848, chosen automatically, computed using 1 PCs
    50. ##
    51. ## Degrees of freedom: model 3.121 , variance 1.205 , residual 5.036
    52. pvalsMod <- pvals(beta_linearRidge)
    53. print(pvalsMod)
    54. ##
    55. ## lambda 2.206848 chosen automatically using 1 PCs
    56. ##
    57. ## Estimate (scaled) Std. Error (scaled) t value (scaled) Pr(>|t|)
    58. ## SNP1 4.045409 0.266120 15.201 < 2e-16 ***
    59. ## SNP2 -1.256154 0.216332 5.807 6.38e-09 ***
    60. ## SNP3 -1.256154 0.216332 5.807 6.38e-09 ***
    61. ## SNP4 0.011635 0.371693 0.031 0.97503
    62. ## SNP5 6.323006 0.315368 20.050 < 2e-16 ***
    63. ## SNP6 -1.373227 0.223047 6.157 7.43e-10 ***
    64. ## SNP7 0.113730 0.372181 0.306 0.75993
    65. ## SNP8 -1.028581 0.355807 2.891 0.00384 **
    66. ## SNP9 -0.008312 0.372386 0.022 0.98219
    67. ## SNP10 0.101128 0.371567 0.272 0.78549
    68. ## SNP11 -1.495699 0.329250 4.543 5.55e-06 ***
    69. ## SNP12 -0.333945 0.372248 0.897 0.36966
    70. ## ---
    71. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    72. plot(pvalsMod)

    9724f71f2b6527d217b38ebc2127bdae.png

    1. pred_phen <- predict(beta_linearRidge)
    2. res <- cbind(GenCont, pred_phen)
    3. head(res)
    4. ## Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
    5. ## 1 1.4356316 1 1 1 0 0 1 0 0 0 0 1 0
    6. ## 2 2.9226960 1 0 0 0 1 0 0 1 0 0 1 0
    7. ## 3 0.5669319 0 0 0 0 0 0 0 0 0 0 2 0
    8. ## 4 4.8515051 1 0 0 0 1 0 0 0 0 0 1 0
    9. ## 5 0.1525582 1 1 1 0 0 1 0 0 0 0 1 0
    10. ## 6 2.9522701 1 0 0 0 1 0 0 0 0 0 0 0
    11. ## pred_phen
    12. ## 1 1.374077
    13. ## 2 2.146179
    14. ## 3 1.340333
    15. ## 4 2.245328
    16. ## 5 1.374077
    17. ## 6 2.341855
    预测基因型
    1. genotypesfile <- system.file("extdata", "GenCont_genotypes.txt", package = "ridge")
    2. phenotypesfile <- system.file("extdata", "GenCont_phenotypes.txt", package = "ridge")
    3. betafile <- tempfile(pattern = "beta", fileext = ".dat")
    4. beta_linearRidgeGenotypes <- linearRidgeGenotypes(genotypesfilename = genotypesfile,
    5. phenotypesfilename = phenotypesfile, betafilename = betafile)
    6. pred_phen_geno <- linearRidgeGenotypesPredict(genotypesfilename = genotypesfile,
    7. betafilename = betafile)
    8. ## compare to output of linearRidge
    9. results <- cbind(pred_phen_geno, pred_phen)
    10. head(results)
    11. ## PredictedPhenotypes pred_phen
    12. ## 1 1.374076 1.374077
    13. ## 2 2.146180 2.146179
    14. ## 3 1.340334 1.340333
    15. ## 4 2.245329 2.245328
    16. ## 5 1.374076 1.374077
    17. ## 6 2.341855 2.341855
    Reference
    1. Significance testing in ridge regression for genetic data. Cule, E. et al (2011) BMC Bioinformatics, 12:372 A semi-automatic method to guide the choice of ridge parameter in ridge regression. Cule, E. and De Iorio, M. (2012) arXiv:1205.0686v1

    2. Cule, Erika, and Maria De Iorio. "Ridge regression in prediction problems: automatic choice of the ridge parameter." Genetic epidemiology 37.7 (2013): 704-714.

  • 相关阅读:
    【Linux】翻山越岭——进程地址空间
    AtCoder Beginner Contest 264 部分题解
    类器官、单细胞分析技术、MAPK信号通路
    python5day
    ssm基于微信小程序的医学健康管理系统
    【2023,学点儿新Java-11】基础案例练习:输出个人基础信息、输出心形 | Java中 制表符\t 和 换行符\n 的简单练习
    Pytorch lr_scheduler.LambdaLR()的简单理解与用法
    数据库开发总结
    萤火商城V2.0开源版[uni-app端],轻量级前后端分离的电商系统,支持微信小程序 + H5+ 公众号 + APP
    Python里的引用与拷贝规律
  • 原文地址:https://blog.csdn.net/weixin_41368414/article/details/136266774