机器学习构建预后模型的文章很多,且越来越卷,动不动就是10种模型的101种组合,这个系列会逐一的介绍这些常用于预后模型变量筛选和模型构建的机器学习方法。
本次介绍弹性网络(Elastic Net):一种用于回归分析的统计方法,它是岭回归(Ridge Regression)和lasso回归(Lasso Regression)的结合,主要区别在于
(1)将 L2 正则化应用于线性回归的损失函数时,称为Ridge回归。
(2)将 L1 正则化应用于线性回归的损失函数时,称为Lasso 回归
(3)将 L1 和 L2 正则化同时应用于线性回归的损失函数时,称为Elastic Net回归。
如果从代码角度来看的话,都可以使用glmnet 包解决,区别在于alpha的参数选择。也就是说Enet主要就是找到(0,1)之间的最优alpha值。
本推文会包含:1-数据拆分,2-两种最优alpha选择方法,3-筛选变量构建cox模型 和 4-直接预测结果预后 等几方面,看到最后。
一 数据输入,处理
沿袭使用前面Lasso得到的SKCM.uni-COX2.RData数据(筛选过的单因素预后显著的基因),后面的更多机器学习的推文均会使用该数据,后台回复“机器学习”即可。
代码语言:javascript复制#载入R包
library(tidyverse)
library(survival)
library(survminer)
library(glmnet)
load("SKCM.uni-COX2.RData")
module_expr.cox2 <- module_expr.cox %>% select(- "_PATIENT") %>%
column_to_rownames("sample")
module_expr.cox2[1:6,1:6]
将数据处理成以上形式,含有 随访时间 生存状态 基因表达信息。1,数据集拆分正常情况下是TCGA构建模型,然后在GEO中进行验证。这里仅为示例,直接按照7:3的比例将TCGA数据拆分为训练集和验证集(后面会介绍更多拆分方法),注意也要设定seed(上一篇推文忘记了)
代码语言:javascript复制# 7:3 拆分
set.seed(1234)
ind <- sample(nrow(module_expr.cox2),nrow(module_expr.cox2) * 0.7 )
train <- module_expr.cox2[ind,]
test <- module_expr.cox2[-ind,]
##确保训练集和验证集的基因一致
gene_com <- intersect(colnames(train) ,colnames(test))
training <- train %>% select(gene_com)
testing <- test %>% select(gene_com)
注:训练集和验证集的基因一致,不然可能存在无法验证的情况。
二 构建Enet 模型
前面提到Enet模型核心的步骤就是选择最优的alpha ,这里介绍两种方式。
1,循环选择最优的alpha
使用循环的方式,使用training 或者 testing数据集选择最优的alpha 。优势在于直接得到best_alpha用于后续分析,方便函数式运行。
代码语言:javascript复制#设置初始参数
cindex_fin <- c()
j=1
best_cindex = 0
best_alpha = 0.1
set.seed(1234)
x1 <- as.matrix(training[,!(colnames(training) %in% c("OS.time","OS"))])
x2 <- as.matrix(Surv(training$OS.time,training$OS))
for (alpha in seq(0.1,0.9,0.1)) {
fit = cv.glmnet(x1,x2,
family = "cox",
alpha=alpha,
nfolds = 10)
#预测
pred_cox = predict(fit,type='link',
newx= as.matrix(testing[,3:ncol(testing)]),
s = fit$lambda.min)
pred_cox <- as.numeric(pred_cox)
#rcorr.cens 函数 计算C index
cindex_cox = 1-rcorr.cens(pred_cox,Surv(testing$OS.time, testing$OS))[[1]]
cindex_fin[j] = cindex_cox
if (cindex_fin[j] > best_cindex) {
best_cindex = cindex_fin[j]
best_alpha = alpha
}
print(paste("alpha_now is " ,alpha ," and the value is " ,cindex_fin[j] ,sep = ""))
j = j 1
}
print(paste("The best alpha is " ,best_alpha ," and the Cindex is " ,best_cindex ,sep = ""))
代码语言:javascript复制[1] "alpha_now is 0.1 and the value is 0.642761007498558"
[1] "alpha_now is 0.2 and the value is 0.644491443953086"
[1] "alpha_now is 0.3 and the value is 0.644106902518746"
[1] "alpha_now is 0.4 and the value is 0.643145548932897"
[1] "alpha_now is 0.5 and the value is 0.643722361084407"
[1] "alpha_now is 0.6 and the value is 0.642376466064218"
[1] "alpha_now is 0.7 and the value is 0.643914631801577"
[1] "alpha_now is 0.8 and the value is 0.643337819650067"
[1] "alpha_now is 0.9 and the value is 0.642568736781388"
[1] "The best alpha is 0.2 and the Cindex is 0.644491443953086"
注意:此处使用rcorr.cens的方式计算Cindex。
2,保留循环中所有alpha的结果,综合评定(推荐)
另一种是循环输出所有alpha参数下的结果,然后综合训练集和验证机结果来选择最优alpha参数
代码语言:javascript复制seed = 1234
result <- data.frame()
for (alpha in seq(0.1,0.9,0.1)) {
set.seed(seed)
fit = cv.glmnet(x1, x2,family = "cox",alpha=alpha,nfolds = 10)
rs <- lapply(val_dd_list,function(x){cbind(x[,1:2],
RS=as.numeric(predict(fit,type='link',
newx=as.matrix(x[,-c(1,2)]),s=fit$lambda.min)))})
cc <- data.frame(Cindex=sapply(rs,function(x){as.numeric(summary(coxph(Surv(OS.time,OS)~RS,x))$concordance[1])}))%>%
rownames_to_column('ID')
cc$Model <- paste0('Enet','[α=',alpha,']')
result <- rbind(result,cc)
}
result
根据结果可以看到按照训练集最优的alpha是0.1,验证集的最优alpha是0.2 。
3,构建Enet模型
这里选择alpha = 0.2 最为最优
代码语言:javascript复制best_alpha = 0.2
fit_F = cv.glmnet(x1,x2 ,
family = "cox",
alpha=best_alpha,
nfolds = 10)
plot(fit_F)
三 Enet 结果-预后验证
这里介绍两种预后模型的验证方式:一是筛选变量然后构建COX预后模型,二是直接预测,验证预后效果。
1,筛选变量构建COX模型
代码语言:javascript复制#此处使用lambda.min, 也可以尝试lambda.1se
coefficient <- coef(fit_F, s = "lambda.min")
#系数不等于0的为纳入的变量(基因)
Active.index <- which(as.numeric(coefficient) != 0)
Active.coefficient <- as.numeric(coefficient)[Active.index]
sig_gene_mult_cox <- rownames(coefficient)[Active.index]
#查看具体哪些基因
sig_gene_mult_cox
# [1] "CYTL1" "CXCL10" "KIT" "CCDC8" "GBP4" "ABCC2" "IFITM1" "CRABP2" "BOK"
#[10] "ZNF667_AS1" "TTYH2" "CXCL11" "PYROXD2" "OLFM2" "HSPA7" "SAMD9L" "CDC42EP1" "MT2A"
#[19] "B2M" "NEAT1" "TRIM22" "HPN_AS1" "PIR" "RAP1GAP" "DYNLT3" "MIR4664" "KIAA0040"
#[28] "HPDL" "TPPP" "UBA7"
#提取sig_gene_mult_cox基因,构建预后模型
training_cox <- training %>%
dplyr::select(OS,OS.time,sig_gene_mult_cox)
#构建COX模型
multiCox <- coxph(Surv(OS.time, OS) ~ ., data = training_cox)
#predict函数计算风险评分
riskScore=predict(multiCox,type="risk",newdata=training_cox)
riskScore<-as.data.frame(riskScore)
riskScore$sample <- rownames(riskScore)
head(riskScore,2)
# riskScore sample
#TCGA-D3-A5GO-06A 0.1895016 TCGA-D3-A5GO-06A
#TCGA-D3-A1Q8-06A 0.9225141 TCGA-D3-A1Q8-06A
######riskScore 二分绘制KM##########
riskScore_cli <- training_cox %>%
rownames_to_column("sample") %>%
inner_join(riskScore)
#按照中位数分为高低风险两组
riskScore_cli$riskScore2 <- ifelse(riskScore_cli$riskScore > median(riskScore_cli$riskScore),
"High","Low")
#KM分析
fit2 <- survfit(Surv(OS.time, as.numeric(OS)) ~ riskScore2, data=riskScore_cli)
p2 <- ggsurvplot(fit2, data = riskScore_cli,
pval = T,
risk.table = T,
surv.median.line = "hv", #添加中位生存曲线
palette=c("red", "blue"), #更改线的颜色
legend.labs=c("High risk","Low risk"), #标签
legend.title="RiskScore",
title="Overall survival", #标题
ylab="Cumulative survival (percentage)",xlab = " Time (Days)", #更改横纵坐标
censor.shape = 124,censor.size = 2,conf.int = FALSE, #删失点的形状和大小
break.x.by = 720#横坐标间隔
)
p2
2,直接预测结果预后
不筛选变量直接预测,预测结果按照median二分后绘制KM曲线, 比较下和筛选变量后的结果差异
代码语言:javascript复制pred_cox = predict(fit_F,type='link',
as.matrix(training[,c(3:ncol(training))]),
s = fit_F$lambda.min)
training$Enet_pred <- as.numeric(pred_cox)
training_surv <- coxph(Surv(OS.time, OS) ~ Enet_pred,data = training)
#C-index
summary(training_surv)$concordance
# C se(C)
#0.70797834 0.02315465
#KM曲线 training
training$Enet_score <- ifelse(training$Enet_pred > median(training$Enet_pred),
"High","Low")
fit3 <- survfit(Surv(OS.time, as.numeric(OS)) ~ Enet_score, data=training)
p3 <- ggsurvplot(fit3, data = training,
pval = T,
risk.table = T,
surv.median.line = "hv", #添加中位生存曲线
palette=c("red", "blue"), #更改线的颜色
legend.labs=c("High risk","Low risk"), #标签
legend.title="RiskScore",
title="Overall survival", #标题
ylab="Cumulative survival (percentage)",
xlab = " Time (Days)", #更改横纵坐标
censor.shape = 124,censor.size = 2,
conf.int = FALSE, #删失点的形状和大小
break.x.by = 720#横坐标间隔
)
p3
有些许差异,按照喜好选择。
后续还可以做的更多分析这里就不赘述了,参考之前的推文。
参考资料:
[1] A machine learning framework develops a DNA replication stress model for predicting clinical outcomes and therapeutic vulnerability in primary prostate cancer
[2] Machine learning-based integration develops an immune-derived lncRNA signature for improving outcomes in colorectal cancer
◆ ◆ ◆ ◆ ◆