Fitting Trees to Curve Data, With an Application t

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

FittingTreestoCurveData,WithanApplicationtoTimeofDayPatternsYanYuandDianeLambertAbstractDecisiontreesoftengivesimpledescriptionsofcomplex,nonlinearrelationshipsbetweenseveralpredictorsandaunivariateormultivariateresponse.Butiftheresponseisahighdimensionalvectorthatcanbethoughtofaspointsalongacurve,thenttingamultivariateregressiontreemaybeunsuccessful.Thispaperexplorestwowaystottreestocurvedata.Bothrstreducethedimensionalityofthedataandthentastandardmultivariatetreetothereducedresponse.Intherstapproach,eachindividual’sresponsecurveisrepresentedasalinearcombinationofnaturalsplinebasisfunctions,penalizingforroughness,andthenamultivariateregressiontreeisttothecoecientsofthebasisfunctions.Inthesecond,amultivariateregressiontreeisttotherstseveralprincipalcomponentscoresfortheresponses.Thetwomethodsareillustratedwithtimeofdaypatternsfortelephonecustomerswhoplaceinternationalcalls.KeyWords:FunctionalResponse;MultivariateRegressionTree;PrincipalComponentAnalysis;RoughnessPenalty;Smoothing;Splines.1BackgroundTheproblemwefaceistopredictacustomer’stimeofdaypatternforinternationalcallingfromtheinformationinthecustomer’srsttwointernationalcalls.Forexample,businesscustomersontheeastcoastwhocallFranceintheirrsttwocallsmaybemorelikelytocallonlycountriesinWesternEuropeandonlyduringbusinesshours,sotheymaytendtoplacecallsearlyinthebusinessday,whilebusinesscustomersontheeastcoastwhocallonlyJapanmightnottendtoplacecallsearlyinthebusinessdaybecauseofthedierenceintimezones.Someinformationinthersttwocalls,suchasduration,iscontinuous,andsome,suchascountrycalled,iscategorical.Therelationshipbetweenthersttwocallsandthesubsequenttimeofdaydistributionmightbecomplex,buttheruleforpredictingthetimeofdaydistributionfromacustomer’srsttwointernationalcallsneedstobesimpleenoughtobeunderstoodbynontechnicalpeople.Becausedecisiontreespartitionthevaluesofthepredictors(informationinthersttwocallsinourcase)intoalimitednumberofsets,eachofwhichcorrespondstoadierentmeanresponse(meantimeofdaydistributionforus),theyareanaturaltoolforgeneratingsimplepredictionrules.Breimanetal(1984)istheclassicreferenceforbuildingtreesforaunivariateresponse.ClarkandYanYuisaPh.D.studentinStatisticsatCornellUniversity.DianeLambertisintheStatisticsResearchDepartmentatBellLabs,LucentTechnologies.Theresearchforthispaperwasundertakenduringthesummerof1997atBellLabs.TheauthorsthankLindaClarkandDarylPregibonforthemultivariateregressiontreecodeandMarkHansen,DavidRuppert,DonSun,ScottVanderWielandalltheseminarparticipantsatBellLabsandCornellUniversityforvaluablesuggestions.1Pregibon(1992)andVenablesandRipley(1994)describettingtreestounivariateresponsesinthestatisticallanguageS(ChambersandHastie1992).Segal(1994)appliesregressiontreestolongitudinaldata.LindaClarkofBellLabsandDarylPregibonofAT&TLabshavewrittenSfunctionsforttingregressiontreeswithamultivariateresponse;weusedtheirfunctionstotthemultivariateregressiontreesinourpaper.Thispapergivesanexampleinwhichnaivelyapplyingmultivariatedecisiontreestolongvectorresponsesisnotsuccessful.Twoproceduresthatreducethedimensionoftheresponseandthentatreetoalowerdimensionalresponsesarepresented.Oneapproachrepresentseachrespondent’stimeofdaydistributionasalinearcombinationofsplinebasisfunctionsandthentsamultivariatetreetotheestimatedcoecientvectors.Theotherapproachusestherstseveralprincipalcomponentscoresastheresponsevector.Bothapproachesgivesensibleresultsinourapplication.Theremainderofthepaperisorganizedasfollows.Section2describesthedata.Section3showsthatttingastandardmultivariatetreetotherawdatagivesapoortanddecisionrulesthatarenotsensible.Section4thendevelopsasplinetreethattreatstheresponsesascurvesratherthanasvectors,andSection5tstheproposedsplinetreetoourdata.Thedecisionrulesbasedonthettedtreearesensible,andbootstrappingcanbeusedtoassessthestabilityofthepredictions(Section6).Section7showsthatttingatreetotherstseveralprincipalcomponentscoresalsogivessensibledecisionrules.Thetofthesplineandprincipalcomponentstreesaresimilar;thesplinetreemaybepreferablewhenasmoothcurveisdesiredforprediction.Section8discussessomealternativewaystobuildtreesforcurvedata.2TheDataOurdataconsistoftherecordsforcompletedinternationalcallsfor1705businessesandresidencesontheeastcoastoftheUnitedStates.Eachcallrecordincludesthecaller’stelephonenumber,starttimeofthecall,called(termination)number,anddurationinminutes.Hereweuseterminationnumbertodeneavariablecalled\regionoftheworldthathaseightcategoriesthatareintendedtocapturegrossdierencesintimezonesandcommunitiesofinterest.ThecategoriesareAfrica,Asia,EasternEurope,India,MiddleEast,SouthAmerica,WesternEurope,and\other,whichmainlycoverssmallislandsandcallstoships.Callrecordsalsoincludeinformationthatcanbeusedtodistinguishlargebusinessesfromresidences,whichgivesabinary\business/otherpredictor.Acustomerinthebusinesscategoryisknowntobeabusiness,butacustomerintheothercategorycouldbeeitheraresidenceorasmallbusiness.Thegoalistopredictthetimeofdaypatternforacustomer’sstartofcalls.Toremovetheeectof

1 / 15
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功