ConceptDecompositionsforLargeSparseTextDatausingClusteringInderjitS.Dhillon(inderjit@cs.utexas.edu)DepartmentofComputerScience,UniversityofTexas,Austin,TX78712,USADharmendraS.Modha(dmodha@almaden.ibm.com)IBMAlmadenResearchCenter,650HarryRoad,SanJose,CA95120,USAOriginalVersion:February14,1999RevisedVersion:October26,2000Abstract.Unlabeleddocumentcollectionsarebecomingincreasinglycommonandavailable;miningsuchdatasetsrepresentsamajorcontemporarychallenge.Usingwordsasfeatures,textdocumentsareoftenrepresentedashigh-dimensionalandsparsevectors–afewthousanddimensionsandasparsityof95to99%istypical.Inthispaper,westudyacertainsphericalk-meansalgorithmforclusteringsuchdocumentvectors.ThealgorithmoutputskdisjointclusterseachwithaconceptvectorthatisthecentroidoftheclusternormalizedtohaveunitEuclideannorm.Asourfirstcontribution,weempiricallydemonstratethat,owingtothehigh-dimensionalityandsparsityofthetextdata,theclustersproducedbythealgorithmhaveacertain“fractal-like”and“self-similar”behavior.Asoursecondcontribution,weintroduceconceptdecompositionstoapproximatethematrixofdocumentvectors;thesedecompositionsareobtainedbytakingtheleast-squaresapproximationontothelinearsubspacespannedbyalltheconceptvectors.Weempiricallyestablishthattheapproximationerrorsoftheconceptdecompositionsareclosetothebestpossible,namely,totruncatedsingularvaluedecompositions.Asourthirdcontribution,weshowthattheconceptvectorsarelocalizedinthewordspace,aresparse,andtendtowardsorthonormality.Incontrast,thesingularvectorsareglobalinthewordspaceandaredense.Nonetheless,weobservethesurprisingfactthatthelinearsubspacesspannedbytheconceptvectorsandtheleadingsingularvectorsarequitecloseinthesenseofsmallprincipalanglesbetweenthem.Inconclusion,theconceptvectorsproducedbythesphericalk-meansalgorithmconstituteapowerfulsparseandlocalized“basis”fortextdatasets.Keywords:conceptvectors,fractals,high-dimensionaldata,informationretrieval,k-meansalgorithm,least-squares,principalangles,principalcomponentanalysis,self-similarity,singularvaluedecomposition,sparsity,vectorspacemodels,textmining1.IntroductionLargesetsoftextdocumentsarenowincreasinglycommon.Forexample,theWorld-Wide-Webcontainsnearly1billionpagesandisgrowingrapidly(),theIBMPatentserverconsistsofmorethan2millionpatents(),theLexis-Nexisdatabasescontainmorethan2billiondocuments().Furthermore,animmenseamountoftextdataexistsonprivatecorporateintranets,inarchivesofmediacompanies,andinscientificandtechnicalpublishinghouses.Inthiscontext,applyingmachinelearningandstatisticalalgorithmssuchasclustering,classification,principalcom-ponentanalysis,anddiscriminantanalysistotextdatasetsisofgreatpracticalinterest.Inthispaper,wefocusonclusteringoftextdatasets.Clusteringhasbeenusedtodiscover“latentconcepts”insetsofunstructuredtextdoc-uments,andtosummarizeandlabelsuchcollections.Clusteringisinherentlyusefulinorganizingandsearchinglargetextcollections,forexample,inautomaticallybuildinganontologylikeYahoo!().Furthermore,clusteringisusefulforcompactlysummarizing,disambiguating,andnavigatingtheresultsretrievedbyasearchenginesuchasAltaVista().Conceptualstructuregeneratedbyclusteringisakinto©2000KluwerAcademicPublishers.PrintedintheNetherlands.final.tex;6/04/2004;23:43;p.12Dhillon&Modhathe“Table-of-Contents”infrontofbooks,whereasaninvertedindexsuchasAltaVistaisakintothe“Indices”atthebackofbooks;bothprovidecomplementaryinformationfornavigatingalargebodyofinformation.Finally,clusteringisusefulforpersonalizedinformationdeliverybyprovidingasetupforroutingnewinformationsuchasthatarriv-ingfromnewsfeedsandnewscientificpublications.Forexperimentsdescribingacertainsyntacticclusteringofthewholewebanditsapplications,see(Broderetal.,1997).Wehaveusedclusteringforvisualizingandnavigatingcollectionsofdocumentsin(Dhillonetal.,1998).Variousclassicalclusteringalgorithmssuchasthek-meansalgorithmanditsvariants,hierarchicalagglomerativeclustering,andgraph-theoreticmethodshavebeenexploredinthetextminingliterature;fordetailedreviews,see(Rasmussen,1992;Willet,1988).Recently,therehasbeenaflurryofactivityinthisarea,see(Boleyetal.,1998;Cut-tingetal.,1992;HearstandPedersen,1996;Sahamietal.,1999;SchützeandSilverstein,1997;SilversteinandPedersen,1997;VaithyanathanandDom,1999;ZamirandEtzioni,1998).Astartingpointforapplyingclusteringalgorithmstounstructuredtextdataistocreateavectorspacemodelfortextdata(SaltonandMcGill,1983).Thebasicideais(a)toextractuniquecontent-bearingwordsfromthesetofdocumentsandtreatthesewordsasfeaturesand(b)torepresenteachdocumentasavectorofcertainweightedwordfrequenciesinthisfeaturespace.Observethatwemayregardthevectorspacemodelofatextdatasetasaword-by-documentmatrixwhoserowsarewordsandcolumnsaredocumentvectors.Typically,alargenumberofwordsexistinevenamoderatelysizedsetofdocuments–afewthousandwordsormorearecommon.Hence,thedocumentvectorsareveryhigh-dimensional.However,typically,mostdocumentscontainmanyfewerwords,1-5%orless,incomparisontothetotalnumberofwordsintheentiredo