随机森林模型在生存分析中的应用【摘要】目的:本文探讨随机森林方法用于高维度、强相关、小样本的生存资料分析时,可以起到变量筛选的作用。方法:以乳腺癌数据集构建乳腺癌转移风险评估模型为实例进行实证分析,使用随机森林模型进行变量选择,然后拟合cox回归模型。结果:随机森林模型通过对变量的选择,有效的解决数据维度高且强相关的情况,得到了较高的AUC值。一、数据说明该乳腺癌数据集来自于NCBI,有77个观测值以及22286个基因变量。通过筛选选取454个基因变量。将数据随机分为训练集合测试集,其中2/3为训练集,1/3为测试集。绘制K-M曲线图:二、随机森林模型随机森林由许多的决策树组成,因为这些决策树的形成采用了随机的方法,因此也叫做随机决策树。随机森林中的树之间是没有关联的。当测试数据进入随机森林时,其实就是让每一颗决策树进行分类,最后取所有决策树中分类结果最多的那类为最终的结果。因此随机森林是一个包含多个决策树的分类器,并且其输出的类别是由个别树输出的类别的众数而定。使用randomForestSRC包得到的随机森林模型具有以下性质:Numberofdeaths:27Numberoftrees:800Minimumterminalnodesize:3Averageno.ofterminalnodes:14.4275No.ofvariablestriedateachsplit:3Totalno.ofvariables:452Analysis:RSFFamily:survSplittingrule:logrankErrorrate:19.87%发现直接使用随机森林得到的模型,预测误差很大,达到了19.8%,进一步考虑使用随机森林模型进行变量选择,结果如下:our.rf$rfsrc.refit.objSamplesize:52Numberofdeaths:19Numberoftrees:500Minimumterminalnodesize:2Averageno.ofterminalnodes:11.554No.ofvariablestriedateachsplit:3Totalno.ofvariables:9Analysis:RSFFamily:survSplittingrule:logrank*random*Numberofrandomsplitpoints:10Errorrate:11.4%our.rf$topvars[1]213821_s_at219778_at204690_at220788_s_at202202_s_at[6]211603_s_at213055_at219336_s_at37892_at一共选取了9个变量,同时误差只有11.4%接下来,使用这些变量做cox回归,剔除模型中不显著(0.01)的变量,最终参与模型建立的变量共有4个。模型结果如下:exp(coef)exp(-coef)lower.95upper.95`218150_at`1.65410.60460.1108624.6800`200914_x_at`0.99151.00860.340942.8833`220788_s_at`0.26493.77500.059441.1805`201398_s_at`1.74570.57290.331099.2038`201719_s_at`2.47080.40470.938086.5081`202945_at`0.41182.42840.039904.2499`203261_at`3.15020.31740.3364129.4983`203757_s_at`0.78611.27200.616561.0024`205068_s_at`0.10739.31800.022230.5181最后选取六个变量拟合生存模型,绘制生存曲线如下:下面绘制ROC曲线,分别在训练集和测试集上绘制ROC曲线,结果如下:训练集:测试集:由于测试集上的样本过少,所以得到的AUC值波动大,考虑使用bootstrap多次计算训练集上的AUC值并求平均来测试模型的效果:AUCat1year:0.8039456AUCat3year:0.6956907AUCat5year:0.7024846由此可以看到,随机森林通过删除贡献较低的变量,完成变量选择的工作,在测试集上具有较高的AUC值,但是比lasso-cox模型得到的AUC略低。附录:load(~/R/brea.rda)library(survival)set.seed(10)i-sample(1:77,52)train-dat[i,]test-dat[-i,]library(randomForestSRC)disease.rf-rfsrc(Surv(time,status)~.,data=train,ntree=800,mtry=3,nodesize=3,splitrule=logrank)disease.rfour.rf-var.select(object=disease.rf,vdv,method=vh.vimp,nrep=50)our.rf$rfsrc.refit.objour.rf$topvarsindex-numeric(var.rf$modelsize)for(iin1:var.rf$modelsize){index[i]-which(names(dat)==var.rf$topvars[i])}data-dat[,c(1,2,index)]i-sample(1:77,52)train-data[i,]test-data[-i,]mod.brea-coxph(Surv(time,status)~.,data=train)train_data-train[,c(1,2,which(summary(mod.brea)$coefficients[,5]=0.1)+2)]tset_data-test[,c(1,2,which(summary(mod.brea)$coefficients[,5]=0.1)+2)]mod.brea1-coxph(Surv(time,status)~.,data=train_data)summary(mod.brea1)names(coef(mod.brea1))plot(survfit(mod.brea1),xlab=Time,ylab=Proportion,main=CoxModel,conf.int=TRUE,col=c(black,red,red),ylim=c(0.6,1))index0-numeric(length(coef(mod.brea1)))coefficients-coef(mod.brea1)name-gsub(\`,,names(coefficients))for(jin1:length(index0)){index0[j]-which(names(dat)==name[j])}library(survivalROC)riskscore-as.matrix(dat[i,index0])%*%as.matrix(coefficients)y1-survivalROC(Stime=train$time,status=train$status,marker=riskscore,predict.time=1,span=0.25*(nrow(train))^(-0.20))y3-survivalROC(Stime=train$time,status=train$status,marker=riskscore,predict.time=3,span=0.25*(nrow(train))^(-0.20))y5-survivalROC(Stime=train$time,status=train$status,marker=riskscore,predict.time=5,span=0.25*(nrow(train))^(-0.20))a-matrix(data=c(y1,y3,y5,y1$AUC,y3$AUC,y5$AUC),nrow=3,ncol=2);aplot(y1$FP,y1$TP,type=l,xlab=FalsePositiveRate,ylab=TruePositiveRate,main=Time-dependentROCcurve,col=green)lines(y3$FP,y3$TP,col=red,lty=2)lines(y5$FP,y5$TP,col=blue,lty=3)legend(bottomright,bty=n,legend=c(AUCat1year:0.9271,AUCat3years:0.8621,AUCat5years:0.8263),col=c(green,red,blue),lty=c(1,2,3),cex=0.9)abline(0,1)riskscore-as.matrix(dat[-i,index0])%*%as.matrix(coefficients)y1-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict.time=1,span=0.25*(nrow(train))^(-0.20))y3-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict.time=3,span=0.25*(nrow(train))^(-0.20))y5-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict.time=5,span=0.25*(nrow(train))^(-0.20))a-matrix(data=c(y1,y3,y5,y1$AUC,y3$AUC,y5$AUC),nrow=3,ncol=2);aplot(y1$FP,y1$TP,type=l,xlab=FalsePositiveRate,ylab=TruePositiveRate,main=Time-dependentROCcurve,col=green)lines(y3$FP,y3$TP,col=red,lty=2)lines(y5$FP,y5$TP,col=blue,lty=3)legend(bottomright,bty=n,legend=c(AUCat1year:0.8761,AUCat3years:0.7611,AUCat5years:0.7611),col=c(green,red,blue),lty=c(1,2,3),cex=0.9)abline(0,1)a-matrix(0,30,3)for(cin1:30){i-sample(1:77,52)train-data[i,]test-data[-i,]mod.brea-coxph(Surv(time,status)~.,data=train)train_data-train[,c(1,2,which(summary(mod.brea)$coefficients[,5]=0.1)+2)]tset_data-test[,c(1,2,which(summary(mod.brea)$coefficients[,5]=0.1)+2)]mod.brea1-coxph(Surv(time,status)~.,data=train_data)names(coef(mod.brea1))index0-numeric(length(coef(mod.brea1)))coefficients-coef(mod.brea1)name-gsub(\`,,names(coefficients))for(jin1:length(index0)){index0[j]-whic