Visualizing High-Dimensional Data Using t-SNE vand

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

JournalofMachineLearningResearch9(2008)2579-2605Submitted5/08;Revised9/08;Published11/08VisualizingDatausingt-SNELaurensvanderMaatenLVDMAATEN@GMAIL.COMTiCCTilburgUniversityP.O.Box90153,5000LETilburg,TheNetherlandsGeoffreyHintonHINTON@CS.TORONTO.EDUDepartmentofComputerScienceUniversityofToronto6King’sCollegeRoad,M5S3G4Toronto,ON,CanadaEditor:YoshuaBengioAbstractWepresentanewtechniquecalled“t-SNE”thatvisualizeshigh-dimensionaldatabygivingeachdatapointalocationinatwoorthree-dimensionalmap.ThetechniqueisavariationofStochasticNeighborEmbedding(HintonandRoweis,2002)thatismucheasiertooptimize,andproducessignificantlybettervisualizationsbyreducingthetendencytocrowdpointstogetherinthecenterofthemap.t-SNEisbetterthanexistingtechniquesatcreatingasinglemapthatrevealsstructureatmanydifferentscales.Thisisparticularlyimportantforhigh-dimensionaldatathatlieonseveraldifferent,butrelated,low-dimensionalmanifolds,suchasimagesofobjectsfrommultipleclassesseenfrommultipleviewpoints.Forvisualizingthestructureofverylargedatasets,weshowhowt-SNEcanuserandomwalksonneighborhoodgraphstoallowtheimplicitstructureofallofthedatatoinfluencethewayinwhichasubsetofthedataisdisplayed.Weillustratetheperformanceoft-SNEonawidevarietyofdatasetsandcompareitwithmanyothernon-parametricvisualizationtechniques,includingSammonmapping,Isomap,andLocallyLinearEmbedding.Thevisualiza-tionsproducedbyt-SNEaresignificantlybetterthanthoseproducedbytheothertechniquesonalmostallofthedatasets.Keywords:visualization,dimensionalityreduction,manifoldlearning,embeddingalgorithms,multidimensionalscaling1.IntroductionVisualizationofhigh-dimensionaldataisanimportantprobleminmanydifferentdomains,anddealswithdataofwidelyvaryingdimensionality.Cellnucleithatarerelevanttobreastcancer,forexample,aredescribedbyapproximately30variables(Streetetal.,1993),whereasthepixelintensityvectorsusedtorepresentimagesortheword-countvectorsusedtorepresentdocumentstypicallyhavethousandsofdimensions.Overthelastfewdecades,avarietyoftechniquesforthevisualizationofsuchhigh-dimensionaldatahavebeenproposed,manyofwhicharereviewedbydeOliveiraandLevkowitz(2003).ImportanttechniquesincludeiconographicdisplayssuchasChernofffaces(Chernoff,1973),pixel-basedtechniques(Keim,2000),andtechniquesthatrepre-sentthedimensionsinthedataasverticesinagraph(Battistaetal.,1994).Mostofthesetechniquessimplyprovidetoolstodisplaymorethantwodatadimensions,andleavetheinterpretationofthec2008LaurensvanderMaatenandGeoffreyHinton.VANDERMAATENANDHINTONdatatothehumanobserver.Thisseverelylimitstheapplicabilityofthesetechniquestoreal-worlddatasetsthatcontainthousandsofhigh-dimensionaldatapoints.Incontrasttothevisualizationtechniquesdiscussedabove,dimensionalityreductionmethodsconvertthehigh-dimensionaldatasetX=fx1;x2;:::;xngintotwoorthree-dimensionaldataY=fy1;y2;:::;yngthatcanbedisplayedinascatterplot.Inthepaper,werefertothelow-dimensionaldatarepresentationYasamap,andtothelow-dimensionalrepresentationsyiofindividualda-tapointsasmappoints.Theaimofdimensionalityreductionistopreserveasmuchofthesig-nificantstructureofthehigh-dimensionaldataaspossibleinthelow-dimensionalmap.Varioustechniquesforthisproblemhavebeenproposedthatdifferinthetypeofstructuretheypreserve.TraditionaldimensionalityreductiontechniquessuchasPrincipalComponentsAnalysis(PCA;Hotelling,1933)andclassicalmultidimensionalscaling(MDS;Torgerson,1952)arelineartech-niquesthatfocusonkeepingthelow-dimensionalrepresentationsofdissimilardatapointsfarapart.Forhigh-dimensionaldatathatliesonornearalow-dimensional,non-linearmanifolditisusu-allymoreimportanttokeepthelow-dimensionalrepresentationsofverysimilardatapointsclosetogether,whichistypicallynotpossiblewithalinearmapping.Alargenumberofnonlineardimensionalityreductiontechniquesthataimtopreservethelocalstructureofdatahavebeenproposed,manyofwhicharereviewedbyLeeandVerleysen(2007).Inparticular,wementionthefollowingseventechniques:(1)Sammonmapping(Sammon,1969),(2)curvilinearcomponentsanalysis(CCA;DemartinesandH´erault,1997),(3)StochasticNeighborEmbedding(SNE;HintonandRoweis,2002),(4)Isomap(Tenenbaumetal.,2000),(5)MaximumVarianceUnfolding(MVU;Weinbergeretal.,2004),(6)LocallyLinearEmbedding(LLE;RoweisandSaul,2000),and(7)LaplacianEigenmaps(BelkinandNiyogi,2002).Despitethestrongper-formanceofthesetechniquesonartificialdatasets,theyareoftennotverysuccessfulatvisualizingreal,high-dimensionaldata.Inparticular,mostofthetechniquesarenotcapableofretainingboththelocalandtheglobalstructureofthedatainasinglemap.Forinstance,arecentstudyrevealsthatevenasemi-supervisedvariantofMVUisnotcapableofseparatinghandwrittendigitsintotheirnaturalclusters(Songetal.,2007).Inthispaper,wedescribeawayofconvertingahigh-dimensionaldatasetintoamatrixofpair-wisesimilaritiesandweintroduceanewtechnique,called“t-SNE”,forvisualizingtheresultingsimilaritydata.t-SNEiscapableofcapturingmuchofthelocalstructureofthehigh-dimensionaldataverywell,whilealsorevealingglobalstructuresuchasthepresenceofclustersatseveralscales.Weillustratetheperformanceoft-SNEbycomparingittothesevendimensionalit

1 / 27
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功