ISSN10001239CN111777TPJournalofComputerResearchandDevelopment47(8):1407-1414,2010:2009-05-11;:2009-10-12:(60741001,60871092,60932008);(JC200611);(ZJG0705)邹权郭茂祖刘扬王峻(150001)(zouquan@xmu.edu.cn)AClassificationMethodforClassImbalancedDataandItsApplicationonBioinformaticsZouQuan,GuoMaozu,LiuYang,andWangJun(SchoolofComputerScienceandTechnology,HarbinInstituteofTechnology,Harbin150001)AbstractAclassificationmethodisproposedforclassimbalanceddata,whichiscommoninbioinformatics,suchasidentifyingsnoRNA,classifyingmicroRNAprecursorsfrompseudoones,miningSNPsfromESTsequences,etc.Itisbasedonthemainideaofensemblelearning.First,thebigclasssetisdividedrandomlyintoseveralsubsetsequally,anditismadesurethateverysubsettogetherwiththesmallclasssetcanmakeupaclassbalancedtrainingset.Thenseveraldifferentmechanismclassifiersareselectedandtrainedwiththesebalancedtrainingsets.Afterthemulticlassifiersarebuilt,theywillvoteforthelastpredictionwhendealingwithnewsamples.Inthetrainingphase,astrategysimilartoAdaBoostisused.Foreachclassifier,thesampleswillbeaddedtothetrainingsetsofnexttwoclassifiersiftheyaremisclassified.Itisnecessarytorepeatmodifyingthetrainingsetsuntilaclassifiercanaccuratelypredictitstrainingsetorreachingthemaximumrepeattimes.Thisstrategycanimprovetheperformanceofweakclassifiersbyvoting.ExperimentsonfiveUCIdatasetsandthreebioinformaticsexperimentsmentionedaboveprovetheperformanceofthemethod.Furthermore,asoftwareprogramnamedLibID,whichcanbeusedassimilarlyasLibSVM,isdevelopedfortheresearchersfrombioinformaticsandotherfields.Keywordsbioinformatics;classimbalance;ncRNAidentification;miningSNPfromEST;classification提出一种处理正反例不平衡的分类方法,以解决生物信息学中的snoRNA识别microRNA前体判别SNP位点的真伪识别等问题.利用集成学习的思想,将反例集均匀分割并依次与正例集组合,得到一组类别平衡的训练集.然后对每个训练集采用不同原理的分类器进行训练,最后投票表决待测样本.为了避免弱分类器影响投票效果,结合AdaBoost思想,将每个分类器训练中产生的错误样本加入到下2个分类器的训练集中,既避免了AdaBoost的反复训练,又有效地利用投票机制遏制了弱分类器的影响.5组UCI测试数据和3组生物信息学实验证明了它在处理类别不平衡分类问题时的优越性.生物信息学;类别不平衡;非编码RNA识别;SNP位点鉴别;分类TP180,.,,[1],.,[2][3][4]..,.,,,,:RNA[5],microRNA[6].,SNP[7]microArray[8].,,.(oversampling)(undersampling).,.,.,.,.SMOTE[9].,,,,.,,,[10].,,Boosting[11][12](oneclasslearning)[13][14][15].Boosting,...UCI,.,,2.,(supportvectormachine,SVM).,LibSVM,[1617].LibSVM,LibSVM,.,.,.,.1,,.,..1.1,,,..,,.,.1.1..:P,N(|P||N|);:F(x),x.num-|N||P|-;14082010,47(8)!N,Nnum,Ni;∀fori#{1,2,∃,num}%TiP+Ni;&TiCi,Ci(x)x(1,-1);∋endfor(F(x)=sgn)numi=1Ci(x).,.Krogh:,[18].3.5(Waikatoenvironmentforknowledgeanalysis,WEKA)[19]38,k.num∗38,38,num;num38,38,i,i%38.,,.,,.AdaBoost,.,AdaBoost,;AdaBoost,.,AdaBoost,,,.1.2AdaBoost.,,.,AdaBoost,,,.1,,Ti,CiTi,Mi.AdaBoostMiTi,C+iM+i,,.,,,−,.,:MiTi+1Ti+2.MiMij,.(i=num),MnumT1T2,()().2.1∀~∋.2..i0;!time0;∀repeat%i(i+1)%num;&TiP+Ni;∋TiCi,Ci(x)x(1,-1);(TiCi,Mi;.forj#{1,2,∃,|Mi|}/firsti;0repeatfirst(first+1)%num;untilTfirstMij;!TfirstMij;∀T(first+1)%numMij;#endfor∃ifi0%timetime+1;&untiltime=max_repeat_timesMi,Mi-1.,,.,.,,,.1.,1409:,,,,1.,.1.Fig.1Anexampleforalgorithm2.12,.,1,AdaBoost.1.3,.()sn(sensitivity)sp(specificity)ACC(overallaccuracy)MCC(Matthewscorrelationcoefficient).TP,TN,FP,FN,:sn=TPTP+FN;(1)sp=TNFP+TN;(2)ACC=TP+TNTP+TN+FP+FN;(3)MCC={TP1TN-FP1FN(TN+FN)(TN+FP)(TP+FN)(TP+FP)}.(4)sn,−recall,,−precisionACC,precision=TPFP+TP.(5),MCCsnsp,,.,MCC,MCCsnsp.,TNFP,TPFN.MCC2TP(TP+FP)(TP+FN)=TPTP+FN1TNTN+FP1TN+FPFP1TPTP+FP2sn1sp1TPTN1TN+FPFP=sn1sp1TPTN111-sp.(6),TPTN,,sn1sp11-sp,MCC,.,,TNFPTPFN,ACC=TP+TNTP+TN+FP+FN2TNTN+FP=sp.(7)ACCsp,,snsp,MCCACC.2,UCI.:UCI,,.,.(snoRNAmicroRNA),,14102010,47(8).(ESTSNP),.2.1UCIcmc,haberman,ionosphere,letterpima5UCI,5,(,).AdaBoost()(UnderSampl)(HSampl)AsymBoostBalanceCascade.55UCI[14].,,12.211,2,.11,1.1,12,.1,5UCI,letter,4.,(cmc,haberman),letter,AdaBoost,,.,,.,,,.,5,3(haberman,ionosphere,pima),1(letter),1(cmc).letter,,.,,.,,,,12,.Table1Performanceof7DifferentClassifierson5UCIDataSets175UCIData(|P||N|)Classifierprecisionrecallcmc(3331140)AdaBoost0.400.39UnderSampl0.330.63HSampl0.370.48AsymBoost0.390.42BalanceCascade0.350.59LibID(once)0.480.74LibID(repeat)0.500.67haberman(81225)AdaBoost0.350.36UnderSampl0.360.60HSampl0.360.47AsymBoost0.340.39BalanceCascade0.360.57LibID(once)0.540.80LibID(repeat)0.590.84ionosphere(126225)AdaBoost0.950.88UnderSampl0.920.89HSampl0.940.86AsymBoost0.950.88BalanceCascade0.930.89LibID(once)0.940.89LibID(repeat)0.940.91pima(268500)AdaBoost0.630.60UnderSampl0.580.73HSampl0.620.65AsymBoost0.630.61BalanceCascade0.600.71LibID(once)0.780.76LibID(repeat)0.770.81letter(78919211)AdaBoost0.990.98UnderSampl0.830.99HSampl0.920.99AsymBoost0.990.98BalanceCascade0.960.99LibID(once)0.880.99LibID(repeat)0.850.98Note:Datainthistableareaveragevaluesof10times5crossvalidation.1411:2.2snoRNARNA(snoRNA)RNA,RNA(rRNA),.RNA(snRNA)RNA(tRNA)RNA(mRNA).,snoRNACDboxsnoRNAHACAboxsnoRNA.Jana:2snoRNAGC,RNACDboxsnoRNAHACAboxsnoRNA[17].CDboxsnoRNA,Jana30645209;HACAboxsnoRNA,Jana658445.LibSVM.2,[17],2LibSVM5.Table2PerformanceofLibSVMandOurMethodonsnoRNA2LibSVMsnoRNARNAMeasurementLibSVMLibIDHACAboxsnoRNAsn0.780.86sp0.890.90CDboxsnoRNAsn0.960.90sp0.910.942,HACAboxsnoRNA,snsp.CDboxsnoRNA,,.,,.2.3microRNAmicroRN