Classification trees with bivariate linear discrim

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

Classi cationTreeswithBivariateLinearDiscriminantNodeModelsHyunjoongKimWei-YinLohDepartmentofStatisticsDepartmentofStatisticsUniversityofTennesseeUniversityofWisconsinKnoxville,TN37996Madison,WI53706hjkim@utk.eduloh@stat.wisc.eduAbstractWeintroduceaclassi cationtreealgorithmthatcansimultaneouslyreducetreecom-plexity,improveclassprediction,andenhancedatavisualization.Weaccomplishthisby ttingabivariatelineardiscriminantmodeltothedataineachnode.Standardalgorithmscanproducefairlycomplextreestructures,becausetheyemployaverysimplenodemodel,whereintheentirepartitionassociatedwithanodeisassignedtooneclass.Wereducethesizeofourtreesbylettingthediscriminantmodelsabsorbsomeofthetreecomplexity.Beingthemselvesclassi ers,thediscriminantmodelscanalsohelptoimprovepredictionaccuracy.Finally,becausethediscriminantmodelsutilizeonlytwopredictorvariablesatatime,theire ectsareeasilyvisualizedbymeansoftwo-dimensionalplots.Ouralgorithmdoesnotsimply tdiscriminantmodelstotheterminalnodesofaprunedtree,asthisdoesnotreducethesizeofthetree.Instead,discriminantmodelingiscarriedoutinallphasesoftreegrowthandthemisclassi cationcostsofthenodemodelsareexplicitlyusedtoprunethetree.Ouralgorithmisalsodistinctfromthe\linearcombinationsplitalgorithmsthatpartitionthedataspacewitharbitrarilyorientedhyperplanes.Weuseaxis-orthogonalsplitstopreservetheinterpretabilityofthetreestructures.Anextensiveempiricalstudywithrealdatasetsshowsthatingeneralouralgorithmhasbetterpredictionpowerthanmanyothertreeornon-treealgorithms.Keywordsandphrases:Decisiontree,lineardiscriminantanalysis,tree-structuredclassi er1INTRODUCTIONAmajoradvantageofclassi cationtreesisthedirectandintuitivewaytheycanbeinterpreted.Consider,forexample,Figure1whichshowsatreeobtainedusingversion4oftheCARTalgorithm(Breiman,Friedman,OlshenandStone,1984;SteinbergandColla,1997).ItisbasedondatafromastudyonbreastcancerattheUniversityofWisconsin(WolbergandMangasarian,ResearchpartiallysupportedbyU.S.ArmyResearchOcegrantDAAD19-01-1-0586.1UnifCellSize2.5BareNuclei5.5416j5benign1j7malignantUnifCellShape2.5ClumpThickness5.518j1benign0j4malignantUnifCellSize4.5BareNuclei2.5MarginalAdhesion3.510j1benign0j3malignant8j48malignant5j172malignantFigure1:CARTtreeforbreastcancerdata.Atanintermediatenode,acasegoestotheleftsubnodeifitsatis estheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcasesatthenode.1990).Thedataconsistofmeasurementstakenfrom699patientson9predictorvariablestakingintegervaluesbetween1and10.Theresponsevariablerecordswhetherapatient'stumorisbenignormalignant.Thetreeisstraightforwardtointerpretandshowsthat vepredictorvariablesmaybesucienttopredicttheresponse.Figures2and3showtreesconstructedfromthesamedatausingtheCRUISE(KimandLoh,2001)andQUEST(LohandShih,1997)algorithms.BotharelesscomplexthantheCARTtreeandarethuseveneasiertointerpret.Butaretheyequallygoodintermsofpredictionaccuracy?Empiricalevidence(Lim,LohandShih,2000;KimandLoh,2001)indicatesthatthesealgorithmstendtoproducetreeswithcomparableaccuracy.Ifthesethreetreesareindeedequallyaccurate,theQUESTtreemaybepreferredforitssimplicity.Theclasscompositionsareshownbesidetheterminalnodesofthetrees.Forexample,theterminalnodeontheextremeleftoftheCARTtreecontains416benignand5malignantcases.Ifauserwishesto ndouthowthese421casesaredistributedinthespaceofthepredictorvariables,whatisthebestwaytodothisgraphically?Onecouldmakeone-dimensionaldotplotsofthedataforeachvariable,usingadi erentplotsymbolforeachclass.Thiswillrequire9dotplots.Betterstill,wecouldlookattwo-dimensionalplots.Butwhichpredictorvariabletoplotagainstwhich?Ascatterplotmatrixofallpairsofpredictorswillcontain92=81plotsperterminalnode.Theseplotscanbetiresometoexamineifthenumberofpredictorsorthenumberofterminalnodesislarge.Besides,manyoftheplotswillprobablybeuninteresting.Clearly,itisusefultohaveamethodthatcanscreenthroughalltheplotsandshowusonly2BlandChromatin3.5UnifCellShape3.5424j7benignClumpThickness6.5MarginalAdhesion5.514j3benign0j7malignant0j28malignantUnifCellShape1.57j1benign13j195malignantFigure2:CRUISEtreeforbreastcancerdata.Atanintermediatenode,acasegoestotheleftsubnodeifitsatis estheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcasesatthenode.UnifCellShape2.5403j9benignBareNuclei2.5UnifCellSize3.539j2benign5j20malignant11j210malignantFigure3:QUESTtreeforbreastcancerdata.Atanintermediatenode,acasegoestotheleftsubnodeifitsatis estheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcasesatthenode.3BlandChromatin3.5UnifCellShape3.5424j714j3820j196Figure4:Classi cationtreeforthebreastcancerdatausingtheproposedMmethod.Atanintermediatenode,acasegoestotheleftsubnodeifitsatis estheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcases.theinterestingones.Wewillsaythataplotis\interes

1 / 20
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功