Hierarchical latent class models for cluster analy

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

HierarchicalLatentClassModelsforClusterAnalysisNevinL.ZhangTechnicalReportHKUST-CS02-02DepartmentofComputerScienceTheHongKongUniversityofScienceandTechnologyClearWaterBay,Kowloon,HongKongRevisedApril2002ThisworkiscompletedwhiletheauthorisonleaveatDepartmentofComputerScience,AalborgUniversity,Denmark.AbstractLatentclassmodelsareusedforclusteranalysisofcategoricaldata.Underlyingsuchamodelistheassumptionthattheobservedvariablesaremutuallyindependentgiventheclassvariable.Aseriousproblemwiththeuseoflatentclassmodels,knownaslocaldependence,isthatthisassumptionisoftenuntrue.Inthispaperweproposehierarchicallatentclassmodelsasaframeworkwherethelocaldependenceproblemcanbeaddressedinaprincipledmanner.Wedevelopasearch-basedalgorithmforlearninghierarchicallatentclassmodelsfromdata.Thealgorithmisevaluatedusingbothsyntheticandreal-worlddata.Keywords:Model-basedclustering,latentclassmodels,localdependence,Bayesiannetworks,learning.1IntroductionClusteranalysisisthepartitioningofsimilarobjectsintomeaningfulclasses,whenboththenumberofclassesandthecompositionoftheclassesaretobedetermined(KaufmanandRousseeuw1990;Everitt1993).Inmodel-basedclustering,itisassumedthattheobjectsunderstudyaregeneratedbyamixtureofprobabilitydistributions,withonecomponentcorrespondingtoeachclass.Whentheattributesofobjectsarecontinuous,clusteranalysisissometimescalledlatentprofileanalysis(Gibson1959;LazarsfeldandHenry1968;BartholomewandKnott1999;VermuntandMagidson2002).Whentheattributesarecategorical,clusteranalysisissometimescalledlatentclassanalysis(LCA)(LazarsfeldandHenry1968;Goodman1974b;BartholomewandKnott1999;Uebersax2001).Thereisalsoclusteranalysisofmixed-modedata(Everitt1993)wheresomeattributesarecontinuouswhileothersarecategorical.ThispaperisconcernedwithLCA,wheredataareassumedtobegeneratedbyalatentclass(LC)model.AnLCmodelconsistsofaclassvariablethatrepresentstheclusterstobeidentifiedandanumberofothervariablesthatrepresentattributesofobjects1.Theclassvariableisnotobservedandhencesaidtobelatent.Ontheotherhand,theattributesareobservedandarecalledmanifestvariables.LCmodelsassumelocalindependence,i.e.manifestvariablesaremutuallyindependentineachlatentclass,orequivalently,giventhelatentvariable.AseriousproblemwiththeuseofLCA,knownaslocaldependence,isthatthisassumptionisoftenviolated.Ifonedoesnotdealwithlocaldependenceexplicitly,oneimplicitlyattributesittothelatentvariable.Thiscanleadtospuriouslatentclassesandpoormodelfit.Itcanalsodegeneratetheaccuracyofclassificationbecauselocallydependentmanifestvariablescontainoverlappinginformation(VermuntandMagidson2002).ThelocaldependenceproblemhasattractedsomeattentionintheLCAliterature(Espeland&Handelman1989;Garrett&Zeger2000;Hagenaars1988;Vermunt&Magidson2000).Methodsfordetectingandmodelinglocaldependencehavebeenproposed.Todetectlocaldependence,onetypicallycomparesobservedandexpectedcross-classificationfrequenciesforpairsofmanifestvariables.Tomodellocaldependence,onecanjoinmanifestvariables,introducemultiplelatentvariables,orreformulateLCmodelsasloglinearmodelsandthenimposeconstraintsonthem.Allexistingmethodsarepreliminaryproposalsandsufferfromanumberofdeficiencies(Section2).1.1OurworkThispaperdescribesthefirstsystematicapproachtotheproblemoflocaldependence.Weaddresstheproblemintheframeworkofhierarchicallatentclass(HLC)models.HLCmodelsareBayesiannetworkswhosestructuresarerootedtreesandwheretheleafnodesareobservedwhileallothernodesarelatent.Thisclassofmodelsischosenfortworeasons.FirstitissignificantlylargerthantheclassofLCmodelsandcanaccommodatelocaldependence.SecondinferenceinanHLCmodeltakestimelinearinmodelsize,whichmakesitcomputationallyfeasibletorunEM.Wedevelopasearch-basedalgorithmforlearningHLCmodelsfromdata.Thealgorithmsystematicallysearchesfortheoptimalmodelbyhill-climbinginaspaceofHLCmodelswiththeguidanceofamodelse-lectioncriterion.Whenthereisnolocaldependence,thealgorithmreturnsanLCmodel.Whenlocaldependenceispresent,itreturnsanHLCmodelwherelocaldependenceisappropriatelymodeled.Itshouldbenoted,however,thatthealgorithmmightnotworkwellondatageneratedbymodelsthatneitherareHLCmodelsnorcanbecloselyapproximatedbyHLCmodels.ThemotivationforthisworkoriginatesfromanapplicationintraditionalChinesemedicine.Intheapplication,modelqualityisofutmostimportanceanditisreasonabletoassumeabundantdataandcomputingresources.Sowetakeaprincipled(asoppositetoheuristic)approachwhendesigningouralgorithmandweempiricallyshowthatthealgorithmyieldsmodelsofgoodquality.Insubsequentwork,wewillexplorewaystoscaleupthealgorithm.1LatentclassmodelsaresometimesalsoreferredtoasNaiveBayesmodels.Wesuggestthattheterm“naiveBayesmodels”beusedonlyinthecontextofclassificationandtheterm“latentclassmodels”beusedinthecontextofclustering.11.2RelatedliteratureThispaperisanadditiontothegrowingliteratureonhiddenvariablediscoveryinBayesiannetworks(BN).Hereisabriefdiscussionofsomeofthisliterature.Elidanetal.(2001)discusshowtointroducelatentvariablestoBNsconstructedforobservedvariablesbyBNstructurelearningalgorithms.Theideaistolookforstructuralsignaturesoflatentvari

1 / 23
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功