A tutorial on Principal Components Analysis

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

AtutorialonPrincipalComponentsAnalysisLindsayISmithFebruary26,2002Chapter1IntroductionThistutorialisdesignedtogivethereaderanunderstandingofPrincipalComponentsAnalysis(PCA).PCAisausefulstatisticaltechniquethathasfoundapplicationinfieldssuchasfacerecognitionandimagecompression,andisacommontechniqueforfindingpatternsindataofhighdimension.BeforegettingtoadescriptionofPCA,thistutorialfirstintroducesmathematicalconceptsthatwillbeusedinPCA.Itcoversstandarddeviation,covariance,eigenvec-torsandeigenvalues.ThisbackgroundknowledgeismeanttomakethePCAsectionverystraightforward,butcanbeskippediftheconceptsarealreadyfamiliar.Thereareexamplesallthewaythroughthistutorialthataremeanttoillustratetheconceptsbeingdiscussed.Iffurtherinformationisrequired,themathematicstextbook“ElementaryLinearAlgebra5e”byHowardAnton,PublisherJohnWiley&SonsInc,ISBN0-471-85223-6isagoodsourceofinformationregardingthemathematicalback-ground.1Chapter2BackgroundMathematicsThissectionwillattempttogivesomeelementarybackgroundmathematicalskillsthatwillberequiredtounderstandtheprocessofPrincipalComponentsAnalysis.Thetopicsarecoveredindependentlyofeachother,andexamplesgiven.Itislessimportanttoremembertheexactmechanicsofamathematicaltechniquethanitistounderstandthereasonwhysuchatechniquemaybeused,andwhattheresultoftheoperationtellsusaboutourdata.NotallofthesetechniquesareusedinPCA,buttheonesthatarenotexplicitlyrequireddoprovidethegroundingonwhichthemostimportanttechniquesarebased.IhaveincludedasectiononStatisticswhichlooksatdistributionmeasurements,or,howthedataisspreadout.TheothersectionisonMatrixAlgebraandlooksateigenvectorsandeigenvalues,importantpropertiesofmatricesthatarefundamentaltoPCA.2.1StatisticsTheentiresubjectofstatisticsisbasedaroundtheideathatyouhavethisbigsetofdata,andyouwanttoanalysethatsetintermsoftherelationshipsbetweentheindividualpointsinthatdataset.Iamgoingtolookatafewofthemeasuresyoucandoonasetofdata,andwhattheytellyouaboutthedataitself.2.1.1StandardDeviationTounderstandstandarddeviation,weneedadataset.Statisticiansareusuallycon-cernedwithtakingasampleofapopulation.Touseelectionpollsasanexample,thepopulationisallthepeopleinthecountry,whereasasampleisasubsetofthepop-ulationthatthestatisticiansmeasure.Thegreatthingaboutstatisticsisthatbyonlymeasuring(inthiscasebydoingaphonesurveyorsimilar)asampleofthepopulation,youcanworkoutwhatismostlikelytobethemeasurementifyouusedtheentirepop-ulation.Inthisstatisticssection,Iamgoingtoassumethatourdatasetsaresamples2ofsomebiggerpopulation.Thereisareferencelaterinthissectionpointingtomoreinformationaboutsamplesandpopulations.Here’sanexampleset:Icouldsimplyusethesymboltorefertothisentiresetofnumbers.IfIwanttorefertoanindividualnumberinthisdataset,Iwillusesubscriptsonthesymboltoindicateaspecificnumber.Eg.referstothe3rdnumberin,namelythenumber4.Notethatisthefirstnumberinthesequence,notlikeyoumayseeinsometextbooks.Also,thesymbolwillbeusedtorefertothenumberofelementsinthesetThereareanumberofthingsthatwecancalculateaboutadataset.Forexample,wecancalculatethemeanofthesample.Iassumethatthereaderunderstandswhatthemeanofasampleis,andwillonlygivetheformula:!$#Noticethesymbol(said“Xbar”)toindicatethemeanoftheset.Allthisformulasaysis“Addupallthenumbersandthendividebyhowmanythereare”.Unfortunately,themeandoesn’ttellusalotaboutthedataexceptforasortofmiddlepoint.Forexample,thesetwodatasetshaveexactlythesamemean(10),butareobviouslyquitedifferent:&%’&%(*)+,Sowhatisdifferentaboutthesetwosets?Itisthespreadofthedatathatisdifferent.TheStandardDeviation(SD)ofadatasetisameasureofhowspreadoutthedatais.Howdowecalculateit?TheEnglishdefinitionoftheSDis:“Theaveragedistancefromthemeanofthedatasettoapoint”.Thewaytocalculateitistocomputethesquaresofthedistancefromeachdatapointtothemeanoftheset,addthemallup,divideby.-,andtakethepositivesquareroot.Asaformula:/!$#10-32540-2Where/istheusualsymbolforstandarddeviationofasample.Ihearyouasking“Whyareyouusing06-2andnot?”.Well,theanswerisabitcomplicated,butingeneral,ifyourdatasetisasampledataset,ie.youhavetakenasubsetofthereal-world(likesurveying500peopleabouttheelection)thenyoumustuse07-2becauseitturnsoutthatthisgivesyouananswerthatisclosertothestandarddeviationthatwouldresultifyouhadusedtheentirepopulation,thanifyou’dused.If,however,youarenotcalculatingthestandarddeviationforasample,butforanentirepopulation,thenyoushoulddividebyinsteadof08-2.Forfurtherreadingonthistopic,thewebpage:0-320-3240-101008-2412242010100Total208Dividedby(n-1)69.333SquareRoot8.3266Set2:0-320-3248-249-1111111224Total10Dividedby(n-1)3.333SquareRoot1.8257Table2.1:Calculationofstandarddeviationdifferencebetweeneachofthedenominators.Italsodiscussesthedifferencebetweensamplesandpopulations.So,forourtwodatasetsabove,thecalculationsofstandarddeviationa

1 / 27
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功