web数据挖掘__11链接分析

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

1LinkAnalysisRoadmap2IntroductionSocialnetworkanalysisPageRankHITSSummaryIntroduction3Earlysearchenginesmainlycomparecontentsimilarityofthequeryandtheindexedpages.I.e.,Theyuseinformationretrievalmethods,cosine,TF-IDF,...From1996,itbecameclearthatcontentsimilarityalonewasnolongersufficient.Thenumberofpagesgrewrapidlyinthemid-late1990’s.Try“classificationtechnique”,Googleestimates:10millionrelevantpages.Howtochooseonly30-40pagesandrankthemsuitablytopresenttotheuser?Contentsimilarityiseasilyspammed.Apageownercanrepeatsomewordsandaddmanyrelatedwordstoboosttherankingsofhispagesand/ortomakethepagesrelevanttoalargenumberofqueries.Introduction(cont…)4Startingaround1996,researchersbegantoworkontheproblem.Theyresorttohyperlinks.InFeb,1997,YanhongLi(ScotchPlains,NJ)filedahyperlinkbasedsearchpatent.Themethoduseswordsinanchortextofhyperlinks.Webpagesontheotherhandareconnectedthroughhyperlinks,whichcarryimportantinformation.Somehyperlinks:organizeinformationatthesamesite.Otherhyperlinks:pointtopagesfromotherWebsites.Suchout-goinghyperlinksoftenindicateanimplicitconveyanceofauthoritytothepagesbeingpointedto.Thosepagesthatarepointedtobymanyotherpagesarelikelytocontainauthoritativeinformation.Introduction(cont…)5During1997-1998,twomostinfluentialhyperlinkbasedsearchalgorithmsPageRankandHITSwerereported.Bothalgorithmsarerelatedtosocialnetworks.TheyexploitthehyperlinksoftheWebtorankpagesaccordingtotheirlevelsof“prestige”or“authority”.HITS:JonKleinberg(CornelUniversity),atNinthAnnualACM-SIAMSymposiumonDiscreteAlgorithms,January1998PageRank:SergeyBrinandLarryPage,PhDstudentsfromStanfordUniversity,atSeventhInternationalWorldWideWebConference()inApril,1998.PageRankpowerstheGooglesearchengine.“StanfordUniversity”thegreat!Google:SergeyBrinandLarryPage(PhDcandidatesinCS)Yahoo!:JerryYangandDavidFilo(PhDcandidatesinEE)HP,Sun,Cisco,…Introduction(cont…)Apartfromsearchranking,hyperlinksarealsousefulforfindingWebcommunities.AWebcommunityisaclusterofdenselylinkedpagesrepresentingagroupofpeoplewithaspecialinterest.BeyondexplicithyperlinksontheWeb,linksinothercontextsareusefultoo,e.g.,fordiscoveringcommunitiesofnamedentities(e.g.,peopleandorganizations)infreetextdocuments,andforanalyzingsocialphenomenainemails..6Roadmap7IntroductionSocialnetworkanalysisPageRankHITSSummarySocialnetworkanalysis8Socialnetworkisthestudyofsocialentities(peopleinanorganization,calledactors),andtheirinteractionsandrelationships.Theinteractionsandrelationshipscanberepresentedwithanetworkorgraph,eachvertex(ornode)representsanactorandeachlinkrepresentsarelationship.Fromthenetwork,wecanstudythepropertiesofitsstructure,andtherole,positionandprestigeofeachsocialactor.Wecanalsofindvariouskindsofsub-graphs,e.g.,communitiesformedbygroupsofactors.SocialnetworkandtheWeb9SocialnetworkanalysisisusefulfortheWebbecausetheWebisessentiallyavirtualsociety,andthusavirtualsocialnetwork,Eachpage:asocialactorandeachhyperlink:arelationship.ManyresultsfromsocialnetworkcanbeadaptedandextendedforuseintheWebcontext.Westudytwotypesofsocialnetworkanalysis,centralityandprestige,whicharecloselyrelatedtohyperlinkanalysisandsearchontheWeb.Centrality10Importantorprominentactorsarethosethatarelinkedorinvolvedwithotheractorsextensively.Apersonwithextensivecontacts(links)orcommunicationswithmanyotherpeopleintheorganizationisconsideredmoreimportantthanapersonwithrelativelyfewercontacts.Thelinkscanalsobecalledties.Acentralactorisoneinvolvedinmanyties.Anexampleofsocialnetwork11中心参与者是与其他参与者的链接或者链接数目最多,最活跃的参与者DegreeCentrality度中心性12无向图:在无向图中,参与者i的度中心性(CD(i))就是参与者节点的度(di),被最大度(n-1)归一化处理后得到的值有向图:基于链出链接条件:连通图取值范围:?ClosenessCentrality接近中心性13如果一个参与者能很容易的与其他参与者进行互动,那么它就是中心的。于是可使用最短距离来计算这个数值注意:取值范围?Prestige权威14Prestigeisamorerefinedmeasureofprominenceofanactorthancentrality.Distinguish:tiessent(out-links)andtiesreceived(in-links).Aprestigiousactorisonewhoisobjectofextensivetiesasarecipient.Tocomputetheprestige:weuseonlyin-links.Differencebetweencentralityandprestige:centralityfocusesonout-linksprestigefocusesonin-links.Westudythreeprestigemeasures.RankprestigeformsthebasisofmostWebpagelinkanalysisalgorithms,includingPageRankandHITS.Degreeprestige度权威15如果一个参与者有许多链入连接或者说被许多其他参与者所推荐,它就是具有权威的.度量一个参与者的权威可以用它的入度Proximityprestige邻近权威16Thedegreeindexofprestigeofanactorionlyconsiderstheactorsthatareadjacenttoi.Theproximityprestigegeneralizesitbyconsideringboththeactorsdirectlyandindirectlylinkedtoactori.Weconsidereveryactorjthatcanreachi.LetIibethesetofactorsthatcanreachactori.Theproximity邻近性isdefinedasclosenessordistanceofotheractorstoi.邻近权威Letd(j,i)denotethedistancefromactorjtoactori.用平均距离计算邻近权威如果计算能够到达i的参与者的比率,并将它和这些参与者与i之间的平均距离相除,就得到[0,1]的邻近权威17Rankprestige等级权威18Intheprevioustwoprestigemeasures,animportantfactorisnotconsidered,theprominenceofindividualactorswhodothe“voting”Intherealworld,apersonichosenbyanimportantpersonismoreprestigiousthanchosenbyalessimportantperson.Forexample,ifacompanyCEOvotesforapersonismuchmoreimportantthanaworkerv

1 / 64
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功